Think back on previous projects that involved a team effort. Remember those projects th at failed to meet deadlines or exceeded the budget. What was the common factor? Was it insufficient hyperparameter tuning or poor model artifact logging? Probably not, right? One of the most prevalent reasons for project failures is bad project management. Effective project management involves breaking down a project into manageable phases and continuously estimating the remaining work in each phase.
Project management encompasses various responsibilities, including sprint execution and retrospectives. However, I want to shift the focus from project management as a role to project management as a skill. Just as anyone in a team can display leadership skills, anyone can also exhibit project management skills. And let me tell you, this is an incredibly valuable skill for a data scientist.
Let’s zoom in on estimating a single phase to illustrate the challenges in data science work estimation. The truth is, estimating the duration of data science work can be very difficult:
- How long will it take to clean the data? It entirely depends on the data you are working with.
- How much time is needed for exploratory data analysis? It completely depends on the discoveries made along the way.
You see my point. This complexity has led many to believe that estimating the duration of phases in a data science project is futile.
However, I believe this is the wrong conclusion to draw. It’s more accurate to say that estimating the duration of a data science phase is challenging to do accurately before commencing the phase. But here’s where project management comes into play—working with continuous estimation. Or, at least, that’s what effective project management is all about 😁
Imagine you’re not estimating a data cleaning job in advance. Instead, you’re one week into the task of cleaning the data. At this point, you know that there are three data sources stored in different databases. Two of the databases lack proper documentation, while the last one lacks data models but has comprehensive documentation. Some data is missing in all three sources, but fortunately, not as much as you initially anticipated. What can you infer from this?
Certainly, you’re not clueless. You know that you won’t complete the data cleaning job tomorrow. On the other hand, you’re confident that three months is an excessive timeframe for this job. Therefore, you have a sort of distribution that indicates the probability of when the phase will be finished. This distribution includes a “mean” (an estimated duration for the phase) and a “standard deviation” (the level of uncertainty in the estimation).
The crucial point is that this conceptual distribution evolves every day. As you gather more information about the work that needs to be done, the distribution changes. Naturally, the “standard deviation” diminishes over time as you become more certain about when the phase will be completed. It’s your responsibility to convey this information to stakeholders. However, when explaining it to stakeholders, you can skip using the technical distribution language I used. That can be our little secret.
Having a data scientist who can articulate something like this is immensely valuable:
“I anticipate this phase will require between 3 and 6 weeks. I will provide you with an updated estimate in a week that will be more accurate.”
In summary, effective project management in data science involves continuous estimation and adapting as new information emerges. By leveraging this skill, a data scientist can provide stakeholders with valuable insights into the project’s progress and expected timelines.