A year ago, I identified a huge growth in cloud implementations to meet demands of data modernization, cost, and security.[1] The adoption of multi-cloud[2] environments that Gartner predicts “more than 85% of organizations will embrace a cloud-first principle by 2025.”[3]
In 2023, we’ve see a pronounced shift in the Data Science Domain, while there have been numerous layoffs across technology there will continue to be a rise with Data Engineering and MLOps taking precedence.[4] There is an even further fever for Certified skillsets, both for on-premises and cloud-based technologies. Let’s look at the skill requirements that makes sense for a data engineer now.
10. Scripting. Yes, skills in scripting are still required. Linux Bash, PowerShell, Typescript, JavaScript, and Python are all still here and if anything were dealing with even more data types (text based allow includes CSV, TSV, JSON, Avro, Parquet, XML, ORC, etc) in the data pipeline that require additional knowledge of ETL / ELT techniques and tools. See more later here on Data Pipeline.
9. Programming. The move to cloud has changed the required languages little in the last year with Java, C# and C++ still important on-premises.[5] More prevalent cloud languages are centered around Go, Ruby, and Rust and especially Python, and Scala with Apache Spark data store and its online cloud implementations like Amazon Glue, Snowflake and DataBricks. Working with streaming real-time data items like social media, NLP, email, controls, on cloud-based systems[6] is only going to increase in the coming years as is AI.
8. DevOps. A foundational piece for the Data Engineers knowledge continue to be in demand. This area includes Software Development Life Cycle (SDLC) and Continuous Development (CD), Continuous Delivery (CD) and Continuous Integration (CI) techniques and tools like Jenkins, Git, GitHub, and GitLab. The process especially tied into DataOps[7], Master Data Management (MDM) and Data Governance results in higher data quality practices and better more accurate results.[8]
7. SQL. There has been significant growth in cloud-based systems adding SQL like interfaces that allow the usage of SQL. A year ago we mentioned Google’s Looker or Amazon’s Athena and QuickSight combination. Now we have Snaowflake and even SparkSQL on Databricks. Relational Database Management Systems (RDBMS) are key still to data discovery and reporting no matter where they reside.
6. NoSQL. Google BigTable, AWS EMR, Azure File and Blob, all related and manage hierarchical file data like the open-source ecosystems of Hadoop. The cloud is full of unstructured or semi-structured (lacking a SQL schema) data stores, in fact over 225.[9] NoSQL, whether open-source Apache based, or MongoDB and Cassandra are all the rage in 2022. Knowing how to manipulate key value pairs and object formats like JSON, Avro or Parquet is still a necessity for these especially with AWS DynamoDb, and Memcaches like Redis still in the spotlight for caching performance.
5. Data Pipelines. Operating with real-time streams, data warehouse queries, JSON, CSV, Parquet, ORC, and raw data is a daily occurrence. The way and where data engineer’s setup storage may change skillsets and tools are required for the ETL / ELT injection. Desperate Data Lakes keep getting new names like DataBricks Lakehouse and Snowflakes Data Cloud implementations. This is one area that is getting more complex and skewed depending on the source and resource used.
4. Multi-cloud computing. 76% of enterprises[11] have already chosen a couple, predominately Microsoft and AWS. Cloud spending last year reached $482 billion[12]. A Data Engineer still needs to have a good understanding of the underlying technologies that make up cloud computing and in particular, knowledge around IaaS, PaaS, and SaaS implementations.[13]
3. Hyper Automation. Gartner states that ““the most successful hyper-automation teams focus on three key priorities: improving the quality of work, accelerating business processes and increasing decision-making agility. “[10] Value added tasks, like running jobs, schedules, events, are a data engineer’s skillset requirement. This trend getting more predominant with specialized Scripting and Data Pipelines tasks required to successful move data to the cloud.
2. Visualization. Working knowledge of tools like SSRS, Excel, PowerBI, Tableau, AWS QuickSight and SageMaker (ML & AI), Google Looker, Azure Synapse is a must. Utilizing off the shelve or even customized algorithms to proof data becomes the norm in 2023.
1. Machine Learning and AI. Heard of ChatGPT anyone? Yes it is the year of AI. Knowledge of terminology and familiarity with algorithms remain an important part of the Data Engineers skillset. At minimum familiarity with Python’s libraries NumPy, SciPy, pandas, sci-kit learn and some actual experience with Notebooks (Jupyter or online cloud) is vital. Taken to the next level in cloud-based tools like AWS Sagemaker, Microsoft’s HDInsight or Synapses or Google’s DataLab toolsets. This fields’ toolsets are getting more complex every year.
Data Engineers must not make one of the five common mistakes; data too complex, inaccurate data, not clarifying, usage requirements and not communicating issues.[14] Trying to gain knowledge on your own, without proper guidance and insight generally takes a long time. A proper certified training program that plans out your schedule, is adaptable, uses real-world labs, and allows you to study with an experienced instructor is key to your success.
[1] https://www2.deloitte.com/us/en/insights/industry/technology/why-organizations-are-moving-to-the-cloud.html
[2] https://www.citrix.com/solutions/app-delivery-and-security/what-is-multi-cloud.html
[3] https://www.gartner.com/en/newsroom/press-releases/2021-11-10-gartner-says-cloud-will-be-the-centerpiece-of-new-digital-experiences
[4] https://www.analyticsvidhya.com/blog/2021/12/a-review-of-2021-and-trends-in-2022-a-technical-overview-of-the-data-industry/
[5] https://www.techrepublic.com/article/the-best-programming-languages-to-learn-in-2022/
[6] https://www.ibm.com/cloud/blog/top-7-most-common-uses-of-cloud-computing
[7] https://jdp491bprdv1ar3uk2puw37i-wpengine.netdna-ssl.com/wp-content/uploads/2019/11/102519_Ultimate_Guide_To_Data_Ops_Tamr.pdf
[8] https://www.oss-group.co.nz/blog/data-governance-key-elements-to-consider
[9] https://hostingdata.co.uk/nosql-database/
[10] https://www.gartner.com/en/information-technology/insights/top-technology-trends
[11] https://www.computerweekly.com/news/252505227/Multicloud-adoption-on-the-rise
[12] https://www.gartner.com/en/newsroom/press-releases/2021-08-02-gartner-says-four-trends-are-shaping-the-future-of-public-cloud
[13] https://www.bigcommerce.com/blog/saas-vs-paas-vs-iaas/#the-key-differences-between-on-premise-saas-paas-iaas