From MSc thesis to Lead Data Scientist

4 years of mistakes, learnings, challenges and successes

Finally, after some months of reflection, it seems that I have managed to organize my thoughts about writing online. Encouraged by a friend (who nailed some of his research work before finishing his PhD), and inspired by other AI tech writers as Chip Huyen, Charity Majors or my former colleague Sofian Hamiti. I believe that it is about time to try to put my two cents in the community as becoming lead data scientist is not something that you usually achieve entirely on your own.

What will you find below?

Today my plan is to give you a 360 view about of how I became a lead data scientist. At least, that’s how people inside/outside my company started to call me after my 3rd year at Siemens Energy. You might find this blog a bit redundant as rivers of digital ink have been already published on the Internet in similar posts. There are even reddit discussions, so your idea of this post could be similar to this:

If that is the case for you, feel free to leave. However, if you are still curious about how someone can go from a MSc thesis to a technical leading position in such a short time, it might be worth it to stay. My main goal is to try to picture an insider’s perspective and to demystify that there is nothing magic about it. Besides, it could help you to know that this is not only about hard-work, hard self-studying hours, and some ted-talk guru advice. Most of the success I had was due to sequential opportunities that happened. Sometimes with detailed planning, sometimes we were just lucky. In short, the team where I was working with managed to get 10000% over 100% of what these could offer. However, during some months it felt that people were looking at us as the fellowship of the ring in the last film:

Gimli Success GIF - Gimli Success Lord Of The Rings GIFs

Fortunately, we never stopped believing in ourselves and we always knew that despite having challenging tasks, all of them were always manageable. We were just shooting to Mars, knowing that our objective was the Moon. On top, I had the extreme luck of having the greatest mentor you could ask for in your first industry job.

To narrate my journey, I could not find a better analogy than the Hype Curve. Without further introduction let’s start from the beginning.

Hajpkurva – Wikipedia
The Hype Curve – A popularize chart by Gartner

The beginning (technology trigger)

I started my journey at my previous company in February 2018. I was full of joy because I finally managed to get thesis work outside my study area (Energy Systems). Since 2017 I knew that the traditional energy engineering job was something that I was not going to enjoy. Who would have said, after studying the same laws more than 10 times and being forced to use Excel for everything ;). Throughout my MSc studies I found much more interesting the Coursera Data Science Specialization than most of my subjects. The world of what they called “data science” was revelealed to me and I found it astonishing.

My first week was great and the onboarding was fairly smooth. My thesis supervisor at the company made one of these whiteboards that you expect to see in an HR campaign of a company trying to get new talent. The possibilities were endless for the project I could develop. In fact, in the next years this project became one of the most important AI-driven products of my division after the successful contributions of many clever people [1][2][3].

Time flew and after 6 months I finished my thesis and got an extension of my contract to work in diverse topics. I didn’t do ML for a while but I was able to get exposed to some basic cloud technologies (EC2, S3 and Redshift). The previous data scientist working there as well as a data engineer colleague provided very good support to me. I built some ETL scripts and started to become a bit better at coding (using loggers, parametrizing scripts, writing CLIs, connecting to DBs etc…). It was a bit stressing to not have many people to ask around specially after both of the data scientists in my team left, but I managed to embrace this. All of this Tech world was amazing and I was able to do things I couldn’t imagine one year before.

Besides I got some “fun” with managing dependencies in EC2, lucky me that the big whale saved me some months after.

Life as data scientist (peak of inflated expectations)

Around March-May 2019 I had my first customer facing meeting and my first business trip. I was a bit nervous because I did not know what was expected from me. Also, it was the first time to go on a project with people I had never met before. We spent 1 week in total and 2 days in the customer site. The customer’s head of analytics was probably one of the smartest guys I have ever meet (the guy had 30+ patents and a PhD) and despite being an engineer, he was training himself with a MSc in Data Science. Moreover, the customer had its own DS team and they were setting up their own cloud environment. The level of maturity they had was something that I did not expect at the time. My colleagues, who had been in other customer facing meetings, had the same impression as usually our customers were in the beginning of their ML journey.

I will always remember my noisy laptop burning out (at that time we didn’t have SEER) in the hotel room while the models were being trained. The week was very stressful, I drank a lot of coffee and slept very little. Nevertheless, after seeing the front-end PoC that a colleague pulled in 2 days as well as the cloud infra and predictions in the DB uploaded by another colleague, I was very relieved.

As you could expect, we managed to succeed. We developed a prototype in 1 week to detect anomalies in the vibration signals on the customer equipment using a combination of unsupervised and supervised learning. The customer was happy, the team was happy, everyone was happy!

After that my permanent contract approval that was pending
(or blocked, you never known) was sorted out and I got permantly hired!

This event happened at the same time that some students started in my team. I had the task to supervise them in their MSc thesis projects in collaboration with my mentor (which was my previous thesis supervisor). I realized by that time how important was to take care of new people. If I had had a different mentor when I started, there would have been no chance to reach that far. The fact that someone much senior than you in the business that you are working in spends time explaining you the context, the vision of the company, how the company makes money, what are the current burning points, what are the potential use cases for your skills etc. is absolutely critical. If you think that a great mentor is the one that reviews your code from A to B, or which model to use I will have to disagree. I am not saying that this is not important (because it is) but this is just a bonus.

In any case, I tried to take care of these 3 students in the best way I could. Two of them got an A as a thesis grade, so I think that we managed well (sorry Dhruv if you are not in the pictures but I did not managed to get yours).

Summer arrived and my mentor got the chance to present in our most important internal DS conference. He was kind enough to allow me to join him in the stage. I realized that he and his projects were well known in the company and I also found that people were doing cool things with AI outside my division. Thanks to this I was able to start expanding my network, get fresh ideas, and realize that we were quite advanced for our team size.

MLOps needs (valley of delusion)

Around November 2019 we started our first full scale machine learning project. After having consolidated 3 big data sources, we had sufficient data to predict some critical KPIs in order to plan the maintenance schedule of our gas turbines. The problem was that we had more than 4000 time series, and because of the characteristics of the data we did not manage to build a global model. In all of the benchmarks I performed, individual time series models were performing much better than a global one (disclaimer, now I am confident I will manage to outperform my previous experiments).

The solution seemed pretty obvious though, after watching a couple of online presentations, looking to the fable package and looking what other companies were doing, 4000 time series didn’t look that much. Adding the hyperparameter tuning jobs and the different algorithms we were ending up in about 100k training jobs.

Everything sounded very reasonable until I said that I needed X compute needs to run this experiment. Of course the classical EC2 we were using was not an option as we needed elastic compute needs on demand. By that time I was also absorbed in the MLOps trend and I was familiar with some tooling to build strong MLOps foundations (MLFlow, the new metaflow, kubeflow, TFX etc…). I had a good picture of what we needed in order to log all of these experiments, orchestrate the workflows, and persist our models. However after stating my needs I received something very different to what I needed.

I remember this time as a very painful one. A colleague of mine and I had to spent around 2-3 months testing something that I knew since day one that it had absolutely 0 chance to work. I knew it so badly that I was astonished when people were not seeing this. It was like if you were talking to people about the colour of the clear sky and they were telling you that the colour was yellow instead of blue. Fortunately, meanwhile we were wasting our time we were also running plan B and C thanks to the support of our data platform team and AWS.

Unless you have a dedicated MLOps platform team, multi-cloud is a must and you need to stay open source. My suggestion would be to use a managed solution. If you really need open source, please do not reinvent the wheel. Kubeflow, Flyte, Metaflow etc… are there for you.

We tried to run a couple of cases in the tooling that we were requested to use and well… absolutely nothing worked (surprise!). We ended up with a combination of AKS (Kubernetes), RStudio Teams (for launching VSCode, RStudio and JupyterLab in kubernetes sessions), RStudio Connect (for lightweight dashboards) and SageMaker + Step Functions for the production code (spolier: worked like charm!).

A lot of political fights and business shenanigans came during and after this period but finally we were able to work more or less properly.

Becoming lead data scientist (the slope of enlightenment)

By Q2 of 2020 we had more or less our data science platform (a.k.a SEER). We started to collaborate more and more with AWS to identify the knowledge gaps in the team and how to organize ourselves better. Finally, our non-technical work environment (managers and project managers specially) understood what an MLOps Engineer was and why most of the data scientist could not get full expertise in building models, deploying containers, making serverless workflows etc.

However there was no one in the team who could bridge the gap at the time and I accidentally became this person. Yes, you got it! I became a unicorn data scientist!

I spent most Sundays studying AWS infrastructure as well as reading thousand of articles about MLOps. Time was getting more and more reduced at working hours and our current PM was not ramping up his technical knowledge despite the increasing complexity of our projects. I had to take over most of the scrum process and I spent most of the time taskifying people. Little by little, my coding time became very limited but fortunately the team was so great that they needed minimal instructions to get the job done. Some quick draw.io charts, some coding skeletons and some paint drafts were more than enough to organize a 1-2 weeks sprint.

By summer 2020 I was spending around 20% of my time making examples of code, 60% organizing the teams in different projects and another 20% making slide decks about what we were doing, and the technical roadmap for our ML platform in the next quarter. Since a couple of months back people started to call me “Lead Data Scientist” in the internal e-mails and it started to make sense in my head. I became one of the persons to go to when it comes to Machine Learning technical topics in our division (apart from my mentor, who was the analytics topic lead in our division). I even made it to appear in the company intranet once!

The efforts in my team were recognized by different partners and we got the opportunity to present in different conferences. We got the offer to present a new feature of AWS SageMaker in reInvent2020. However, the feature did not make it on time and we ended up presenting our first ML at scale use case.

Rolling out projects (plateau of productivity)

After very hard work we managed to have our first version of MLOps tooling around 2021 Q1. We were writing some light unit tests, testing the individual components of our workflow, and we even wrote our custom-made integration test tooling. This amazing tool was able to test the workflows end to end bypassing the limitations of our jobs timeout in gitlab. Unfortunately, the mechanism of this tool is covered by my non expired NDA so I cannot reveal that much. But in short, we could think as some sort of predecessor (in terms of functionality) of what you can do now with SageMaker Pipelines or with Kubeflow Pipelines.

We made tremendous progress during summer and we even implemented an active model monitoring system based on model performance drift (that is, comparing the previous predictions with the actuals when the actuals arrived). Our implementation was an earlier version of what today is called the “Feedback endpoint” implemented by Gantry or SeldonCore. I was so proud of this since most of the monitoring capabilities of ML systems were centered around data-drift and latency metrics. We did not manage to implement incremental training and checkpointing, but still, I think it was a tremendous success:

Our beautiful error datamart according to our model metadata and model metrics (this later became a production dashboard) but by the time ggplot2 was sufficient. The dashboard was showing how the retraining was triggering every time that the error went up above a certain level.

We reached a minimum in technical debt, despite of still having plenty of work to do, it was fairly easy to make code changes and deploy them. We also developed very good observability capabilities and we were able to always identify when a production pipeline was failing and in which step. The nightmare of having to deploy code manually by connection to an EC2 instance in a VPC was gone and despite of some small legacy pipelines everything was looking bright. Most of the MLOps tooling we built was reusable and we were implementing it step by step in other projects.

At the beginning of 2021, 3 new MSc thesis students joined that year and they were running on-demand training jobs remotely in just 3-4 months! This would have been impossible two years ago. All the hours burning my local computer or running hard to debug nohups python commands in a 24/7 running EC2 instance were now in the past. We only needed people willing to learn, and they could build whatever they wanted.

How I would have felt in 2019 with the right ML platform

In addition, around March-April I got an additional position at company wide level and I felt quite rewarded by the recognition and trust of some of our global management team. Unfortunately, I was already burnt out, I felt I was a technical lead but I struggled with the little time I had to develop my technical skills. Some of my thoughts are very well summarized in this blog post by Charity Majors: “Questionable Advice, the trap of the premature senior”.

The worst part was having to explain certain things again and again to the same people. I was astonished at how some people were either so lazy or so unprepared for the positions they were holding. The second one is acceptable, the first one is very hard to swallow. Our team did pretty much the impossible and beyond to deliver on our projects, we were learning constantly and we were pushing the boundaries of our cloud provider day by day. We also went far beyond our job scope as data scientist or software developers. Still, there were some people that didn’t even bother to learn the basics. Unfortunately, my work ethics did not allow me to stay in a place maintaining this status quo, and that’s why I decided to leave.

Now, after a bit more than one year, I can look back and remember more good things that bad things. I still miss some of the technology I built there but I have to confess that my current team has managed to get pretty close to what I did in the past. In fact, in some areas I feel I have a better setup now. Besides, taking into account that it only took one year to reach the same status, it feels to me that my current team is somehow magical.

My path at Siemens Energy ended up handing over the torch to one of the students I supervised in the past and this made me really happy. I knew he was going to have a tremendous amount of work after I left, but I was confident that he was going to be able to make it. We presented in the Data Innovation Summit 2021 and a new era for me started at B2Holding ASA.

The lessons learnt

After all this text, you probably got more or less the idea of how I became lead data scientist but this post was about the lessons learnt and I think I can summarize them as follows:

  1. Try to get as much exposure to technology at the beginning of your career. The better you understand your technological context, the better will you become in collaborating with people that are not data scientists but have other essential technical roles.
  2. Try to work in projects that do not involve pure ML model building. It is a good way to get your hands dirty outside your knowledge area. The time I worked with ETL stuff, logging, interaction with object storage etc… has paid off.
  3. Find a mentor, and a good one (this one is the most difficult one but maybe luck is on your side).
  4. Try to expand your network, via internal conferences, the official internal communications channels of your company or externally by presenting/attending in conferences. This will give you an overview of what is going on in the sector. In addition, it is a great source of inspiration for new projects.
  5. MLOps and cloud computing are important, if you think that in 2023 you can survive with jupyter notebooks and just making your models the most accurate ones… well… best of luck!
  6. When I started to wrote this post it was 2022, so to add up on the previous one (MLOps), I think it is important to consider that ML is really going real time (whether you like it or not). Especially in the last six months I have started to see how multiple cloud providers are making this easier and easier day by day. And then we have Max Halford with River, Debezium and other open source tooling.
  7. Learn software engineering. I’m not saying you need to write unit tests all the time as a data scientist but it is useful if you produce proper code and you know the basics of testing software. Also, writing yourself unit tests instead of delegating this to MLOps Engineers is a good way of checking if your functions are too complex or not. Finally, you put yourself in the skin of others and this will improve your work cooperation with your peers. (check point 1).
  8. Write your docstrings or you will find me at your doorstep with a sharp knife (it wasn’t mentioned above but more posts will follow about this).
  9. Do not overcomplicate your algorithms unless you need to do so. Of course, in areas like NLP and CV neglecting the performance of CNNs, Transformers etc… will be foolish. However, you need to think that if you are working with tabular data, most of the non DS people dealing with these data do not go further beyond linear regression. Yes, MQN-HiTS (multi quantile neural hierarchical interpolation for time series) looks promising, but good luck explaining this to your stakeholders (also good luck beating trees for time series forecasting).
  10. Create outcome, not output. Skill level is rarely measured with output, output is what we produce as individual contributors, outcome in the other hand, is the impact of the stuff we produce. I encourage you to read this blog post about Staff Engineers archetypes. When I started SEER I was acting as architect, when I created the first E2E ML project I was acting as Tech Lead and Solver. Finding your archetypes is the base to understand which outcome can you create.
  11. If it is cheap and it doesn’t affect project deadlines, break as many things as you feel. Sometimes it is useful to learn by iteration.

Thank you for reading this post and feel free to follow me on Linkedin to know more about me!

Building a poor man’s data lake: Exploring the power of Polars and Delta Lake

Or why you probably do not need databases in the way you think

Introduction

Embarking on the journey of data engineering often feels like navigating a landscape where innovation and repetition go hand in hand. The cyclic introduction of new query engines and MPP databases, each attempting to redefine the familiar staging, historical, and data warehouse layers, can, at times, border on mundane. It’s an environment that rarely sparks excitement for me as I find data science a much more interesting realm. However, the data lake concept was kind of a revelation.

Early in my career, I found myself intrigued by the concept of separating storage from compute in the world of OLAP and analytics. This fascination was fueled by my positive encounter with Snowflake, a platform that elegantly embraced this separation. It led me to ponder the evolution from traditional Data Warehouse (DWH) systems to what felt like a more natural progression—the data lake or the lake house architecture.

In parallel to the rise of micro-service architectures, the notion of breaking down complex structures into more manageable “small microservices” seemed not only logical, but also promising for the field of data engineering. Reflecting on my past experiences with on-premise MSSQL servers managed by IT people and not expert data engineers (which lacked proper version control, integration with identity services, and a general air of disorder with regards of their administration), I couldn’t help but yearn for a more efficient and cost-effective solution.

Also, the practice of employing OLTP systems for executing intricate joins and on-the-fly aggregations struck me as a somewhat misguided use of SQL Server. Furthermore, there was a prevailing belief among some individuals that SQL Server served as a viable option for housing an exorbitant volume of time series data—billions upon billions of records. Setting aside the questionable wisdom of such a notion, I can only imagine that the cloud bill for this SQL Server storage was definitely not appealing to the eyes of rational people. It becomes evident that, at the beginning of my career, my sentiment toward databases was far from favorable.

But what is a data lake and why do we need one?

As I delve into the realm of data engineering, let’s explore the foundations of this field and unravel the advantages that data lakes, particularly, leveraging technologies like Polars and Delta Lake, hold over their traditional data warehouse counterparts. The landscape may be ever-changing, but the essence lies in adapting, evolving, and embracing innovations that promise efficiency and progress in the world of data.

As organizations seek more flexible, scalable, and cost-effective solutions, data lakes emerged as a compelling alternative. A data lake stands as a centralized repository empowering organizations to store their structured and unstructured data at any conceivable scale. This encompasses raw data originating from diverse sources such as PDFs, IoT devices, and business applications. The distinctive feature that sets data lakes apart from their traditional warehouse counterparts is the flexibility to house data in its unaltered state.

However, this flexibility does come at a cost. Ultimately, the pursuit is for high-quality, validated data, and the imposition of schemas through SQL DDLs or data contracts becomes imperative to derive meaningful insights. One of the defining strengths of data lakes lies in their exceptionally low storage costs compared to traditional data warehouse solutions. To illustrate this, consider the Azure SQL Server storage costs

In comparison ADLS Gen2 costs:

Here, the hot storage—likely swift enough to cater to the majority of your analytical workloads—comes at a fraction of the cost, approximately ten times less. Moreover, cloud storage services offer the flexibility to enact life cycle policies, enabling the seamless movement of data across storage tiers based on both the creation date and access frequency. An illustration of such a storage policy is provided below:

{
  "rules": [
    {
      "enabled": true,
      "name": "sample-rule",
      "type": "Lifecycle",
      "definition": {
        "actions": {
          "version": {
            "delete": {
              "daysAfterCreationGreaterThan": 90
            }
          },
          "baseBlob": {
            "tierToCool": {
              "daysAfterModificationGreaterThan": 30
            },
            "tierToArchive": {
              "daysAfterModificationGreaterThan": 90,
              "daysAfterLastTierChangeGreaterThan": 7
            },
            "delete": {
              "daysAfterModificationGreaterThan": 2555
            }
          }
        },
        "filters": {
          "blobTypes": [
            "blockBlob"
          ],
          "prefixMatch": [
            "sample-container/*.avro",
            "sample-container/*.parquet",

          ]
        }
      }
    }
  ]
}
JSON

This can help organizations to dramatically cut their data storage costs over time. Basically, the storage policies will transition all files with *.parquet or *.avro formats to different storage tiers depending on some rules that we can decide upfront.

Big data formats for data lakes

Traditionally, schema-rich formats (.parquet and .avro) have been used to store structure data in the data lake. Parquet is more suitable as columnar storage and Avro is more suitable for row-based formats. These formats allow us to validate or cast our data types to the desired format in the same way that table DDL scripts block data to be ingested in tables that do not fulfill some basic quality rules.

Parquet Example

import polars as pl

# Create a sample Polars DataFrame
data = {'name': ['Alice', 'Bob', 'Charlie'],
        'age': [25, 30, 35],
        'city': ['New York', 'San Francisco', 'Los Angeles']}

df = pl.DataFrame(data)

# Define the desired schema
desired_schema = pl.Schema([
    ('name', pl.DataType.Utf8),
    ('age', pl.DataType.Int64),
    ('city', pl.DataType.Utf8)
])

# Apply the desired schema to the DataFrame
df = df.cast(desired_schema)

# Write the Polars DataFrame to a Parquet file with the specified schema
file_path = 'example.parquet'
df.write_parquet(file_path)
Python

Avro Example

import fastavro
import polars as pl

# Create a sample Polars DataFrame
data = {'name': ['Alice', 'Bob', 'Charlie'],
        'age': [25, 30, 35],
        'city': ['New York', 'San Francisco', 'Los Angeles']}
df = pl.DataFrame(data)

# Convert Polars DataFrame to a list of dictionaries
records = df.to_dict(orient='records')

# Define Avro schema based on the DataFrame
schema = {
    'type': 'record',
    'name': 'example',
    'fields': [
        {'name': 'name', 'type': 'string'},
        {'name': 'age', 'type': 'int'},
        {'name': 'city', 'type': 'string'}
    ]
}

# Serialize the data to Avro format
with open('output.avro', 'wb') as avro_file:
    fastavro.writer(avro_file, schema, records)
Python

One might be tempted to think that these formats come with a catch, given the overhead involved when writing them to storage, particularly when applying heavy compression. However, the efficiency of the writers, coupled with engines that facilitate parallelized writing, mitigates this concern to a considerable extent. In addition, the absence of the need for query engines to compute the schema, results in significantly faster reads in comparison with other formats.

Nevertheless, challenges arise when the need to modify these files appears, or more precisely, when overwriting becomes a necessity. To illustrate this point, consider a Parquet table with the following structure:

/
└── 📁 mydata
├── 📁 year=2022
│ ├── 📁 month=11
│ │ └── 📄 file1
│ │ └── 📄 file2
│ └── 📁 month=12
│ └── 📄 file1
│ │ └── 📄 file2
└── 📁 year=2023
└── 📁 month=1
└── 📄 file1
│ │ └── 📄 file2

If we want to modify the schema of the parquet table, we need to overwrite all of the partitions. These can be problematic, especially if our number of files is big, as the list operations of these files is bottlenecked by the performance of the storage account. This is commonly known as the file listing problem:

Extracted from Delta Lake blog


Moreover, when Parquet files are overwritten, all prior data is lost. Unlike transaction databases, there is no mechanism for rolling back the transaction. If an error occurs during the file writing process, you run the risk of corrupting your table. The delta lake vs parquet blog post is a great resource to learn about the limitations of parquet.

Transactional data lake formats

For a conventional, old-fashioned Data Warehouse (DWH) engineer, the absence of ACID properties in flat files can be akin to a nightmare. After all, databases were originally conceived to address specific needs. Despite the allure of low storage costs, losing the features that make databases invaluable raises concerns. In essence, our ideal file format should possess the ability to:

  • Support DML language, that is: INSERTS, DELETES, UPSERTS and SELECTS.
  • Support schema evolution without having to rewrite the whole table (in the same way you can add a column to a DB by changing the definition of the DDL of a table)
  • Guarantee consistency among others (ACID properties).
  • Support partitioning, clustering and indexes.
  • Support SCDs (slow changing dimensions) in an easy way.

Transactional data lakes were born to overcome these limitations. Currently, the more prominent ones are:

  • Apache Iceberg: Iceberg focuses on providing ACID transactions for large-scale analytical data sets. It introduces features like atomic commits, schema evolution, and consistent snapshots. The table format is designed to be portable across different storage systems, allowing users to switch between systems without rewriting data. Apache iceberg also enables partition evolution (unlike Delta or Hudi).
  • Delta Lake: Delta Lake, from the creators of Spark, addresses the challenges associated with data lake reliability. It offers ACID transactions, schema evolution, and time travel capabilities.
  • Apache Hudi: Hudi is tailored for incremental data processing and simplifying data management in Hadoop-based systems. It supports upserts and deletes at the record level, making it efficient for handling changing data.

Choosing among these formats depends on factors such as your existing technological infrastructure, team expertise, and the specific needs of your business. Each format comes with its own set of advantages, and the decision should align with your organization’s goals and preferences. In this context, I will delve deeper into Delta Lake, considering its prominent position in the landscape.

Why delta lake?

Conceptually, delta lake is an ecosystem. It is an open-source storage framework that enables building a Lakehouse architecture with compute engines including Spark, PrestoDB, Flink, Trino, and Hive and APIs for Scala, Java, Rust, Ruby, and Python.

Fundamentally you can interact with Delta with 2 kind of APIs:

  • Direct Java/Scala/Python APIs – The classes and methods documented in the API docs are considered as stable public APIs. All other classes, interfaces, methods that may be directly accessible in code are considered internal, and they are subject to change across releases. It is worth to mention that the python API mentioned in these docs needs an spark session to execute delta operations.
  • Spark-based APIs – You can read Delta tables through the DataFrameReader/Writer (i.e. spark.readdf.writespark.readStream and df.writeStream). Options to these APIs will remain stable within a major release of Delta Lake (e.g., 1.x.x).

When we take a look to the writer clients that Apache Hudi supports:

As well as Apache Iceberg:

We realized that, at present, there isn’t a straightforward Pythonic method to write to these formats using Python alone, excluding our trusty PySpark. I’m not passing judgment on whether this is positive or negative. I’m merely highlighting that it could potentially create an entry barrier. Beyond this consideration, it’s crucial to acknowledge that we’re constructing a poor man’s data lake. And that means that we do not have neither the budget, nor the data size to spin up a spark cluster to process thousands of TB of data. In fact, tweaking spark for small files is actually something we need to setup properly.

Having said that, the Delta Lake landscape presents intriguing possibilities. Firstly, there’s the remarkable delta-rs package—an implementation of the Delta Lake protocol in Rust, featuring Python bindings. This is what we call “python deltalake” (with no spark dependencies). Additionally, capturing the spotlight is the emerging star of the show: Polars.

Polars is able to write directly to delta.

Python bindings of delta-rs offers more functionality than polars (but merges/upserts should be supported in the next release as my loud complain/Xmas wish was granted)

Hi Ion 🙂

Why polars?

As mentioned above, polars is the new start in the show. Choosing between Polars and Spark for data processing workloads depends on various factors, each with its own set of considerations. Polars stands out as a compelling alternative, particularly for certain use cases.

Polars, built in Rust and featuring Python bindings, offers a more lightweight and memory-efficient solution compared to the more resource-intensive Apache Spark. This becomes especially advantageous when working with moderately sized datasets on machines with limited resources. The performance of Polars is noteworthy, with the ability to handle complex operations efficiently, making it a robust choice for analytical workloads.

One of Polars’ key strengths lies in its native support for parallelization, allowing for faster execution of operations across multiple cores. This parallel processing capability contributes to impressive performance metrics, enabling users to process data at scale with efficiency.

Moreover, Polars’ seamless integration with Pandas, a popular data manipulation library in Python, enhances its appeal. The familiarity of Pandas syntax makes it easier for data scientists and analysts to transition to Polars, streamlining the learning curve.

While Spark is a robust and feature-rich framework suitable for large-scale distributed computing, Polars excels in scenarios where agility, ease of use, and efficient memory utilization are paramount. For tasks involving mid-sized datasets and resource-constrained environments, Polars emerges as a nimble and powerful contender, offering a refreshing alternative to the heavyweight capabilities of Spark.

Finally it is important to mention that polar bears can quack.

A poor man’s data lake

To build a poor man’s data lake we will need the following:

  • A storage service (S3, MinIO, ADLSGen2 etc…)
  • A layered setup based on the data lake house architecture (raw, bronze, silver, gold)
  • Python magic with polars and delta lake.

The storage service is up to the reader to set up. For the remaining parts, I have made a minimal demo repo that you can find here.

The pipeline performs the following actions:

  1. It reads ZIP files from a set of URLs.
  2. It decompresses them and reads the .csv files inside.
  3. It saves them locally as .parquet files in thefile system of the compute engine you are using to run your code.
  4. It uploads them to a landing zone in ADLSGen2.
  5. It reads from the landing zone and then it converts the parquet files to delta and sends them to the bronze layer (append only).
  6. It reads from the bronze layer and performs a merge (without deduplication) against the silver table.
  7. Besides, it shows how to perform Z-orders (you do not need to runs Z orders every time you append a file but this was just for testing).
if __name__ == "__main__":
    t = time.time()
    etl_workflow = ETLPipeline()
    for uri in DOWNLOAD_URIS:
        try:
            etl_workflow.upload_to_landing(uri=uri)
            etl_workflow.raw_to_bronze()
            etl_workflow.bronze_to_silver()
        except Exception as e:
            logger_normal.error(e)
            logger_normal.info("Continuing for next file")
            pass
    logger_normal.info(f'Total runtime was {time.time()-t}')
Python

Results

If you are checking the code you will probably notice some reference to the following environment variables:

LANDING_ZONE_PATH = os.getenv('LANDING_ZONE_PATH')
ACCOUNT_NAME = os.getenv('STORAGE_ACCOUNT_NAME')
BRONZE_CONTAINER = os.getenv('APPEND_LAYER')
SILVER_CONTAINER = os.getenv('HISTORICAL_PATH')
GOLD_CONTAINER = os.getenv('DW_PATH')
Python

This should have values that are meaningful to you, but in my case the storage account structure is as follows:

The landing zone contains all the original CSV files converted to Parquet format. These files will be eliminated by the storage policies mentioned at the beginning of the post after 1 day:

Then we have our append only layer that contains only the files in delta format that matched the schema generated by the first file (as of today there is no possibility to merge schema in Delta Lake without using PySpark):

Finally, the silver layer contains the incrementally merged data:

And in the _delta_log_ folder you can find the transactional log that shows you how the delta table was modified with a merge operation.

{
    "remove": {
        "path": "part-00001-9ec91f4f-f438-41b...c000.snappy.parquet",
        "dataChange": false,
        "deletionTimestamp": 1701785668789,
        "partitionValues": {},
        "size": 80948938
    }
}
{
    "add": {
        "path": "part-00001-2bb8e5fb-65c3-4ca5-9d16...*.parquet",
        "partitionValues": {},
        "size": 57361922,
        "modificationTime": 1701785675935,
        "dataChange": false,
        "stats": "{\"numRecords\":3352527,\"minValues\".....}}",
        "tags": null,
        "deletionVector": null,
        "baseRowId": null,
        "defaultRowCommitVersion": null,
        "clusteringProvider": null
    }
}
{
    "commitInfo": {
        "timestamp": 1701785676006,
        "operation": "OPTIMIZE",
        "operationParameters": {
            "targetSize": "104857600"
        },
        "clientVersion": "delta-rs.0.17.0",
        "readVersion": 5,
        "operationMetrics": {
            "filesAdded": {
                "avg": 57361922.0,
                "max": 57361922,
                "min": 57361922,
                "totalFiles": 1,
                "totalSize": 57361922
.....
}
JSON

It is probably worth to mention that in a 16 cores machine with 64gbs of RAM I am getting an average performance of 134k merge condition evaluations per second. Which basically is around 8M rows per minute, which I think is astonishing.

Final notes

It’s important to acknowledge that this example represents a minimal implementation. We haven’t integrated our code with an orchestrator, utilized any external compute engines, used all the features of delta lake (as file compaction, vacuuming etc…) or established a data catalog connection within the code. Despite these simplifications, my intention is that this post serves as a stepping stone, illustrating that working with transactional data lakes does not need to be overly complex.

If you find merit in this example and wish to contribute, please feel free to open a pull request in the example repository. Lastly, and perhaps most significantly, if you find these technologies intriguing, we’re currently expanding our team, so feel free to reach out to our head of analytics if you are interested in working with us.