The right flow to start a data project.


There is a huge difference whether a business mission stumbles across either tech or data vision. If it is the latter approach it is totally different to the former one and the goal is addressed looking at what data someone has on hand rather than thinking of hardware and software solutions. Throughout this post I propose a method to start a data project, also I underline considerations to increase the success ratio and I will give examples we went through at BitPhy to understand the whole point more clearly.

Not always, but frequently, it all starts with a business need. At BitPhy our idea was to use new and disruptive technologies to make SME competitive against the “pezzonovante” of the retail sector. And looking for such new technologies we got our hands on a pretty old one: the point of sales (POS). In particular, we thought about morph POS into an as-a-service machine.



A business idea can get into the realm of data if any consistent data source exists and the POS has all the transactional data of a store, what, when, and for how much everything is sold.

For us, accesing the POS database was the trigger of our project.

When you first take a look at a new data source a good list of questions has to be properly answered to evaluate the viability of the business or project (the following list is not necessarily sorted by importance):

  • How important is the data therein for the business?

This is a non-measurable question. However one can reformulate it in the following way: What are the main KPIs of the business? And, is the data offering any insight regarding those KPIs?

In Bitphy we plan to give value to grocery stores whose main KPIs are billing, sales and stock measurements. Well that is precisely what one can find on the POS.

  • How unique is the dataset?

The rarest and most hide it is the easiest for you to strengthen your position if you get there first. A clear example is open data. I remember a while ago dozens of companies start exploiting open data APIs as those APIs flourished. Nowadays, it is a saturated niche, only a few of those initiatives survived, and it is arduous to offer any new and original product in the sector. Data value is inversely proportionate to how widespread it is. This is an important consideration in order to forecast the success of a data project.

In our case, we are accessing a de facto dark web. The POS stores the data in non-indexable databases on the internet, then if you become the partner of the POS providers, almost certainly the doors are closed for any other player.

  • What is the size of the data source?

The more the better. It is what nowadays is known as big and small data. A normal grocery store is small data, having at most a few million of rows, but some of them get bigger, especially when ensemble algorithms are used. Not only enhanced algorithms are possible when you have more data (clean data actually), but also more money, if your data is enough to keep track of a market, then you struck gold.

Therefore even thought here are obvious limitations for those that make consultant solutions for the big players of grocery stores, if we could reach a remarkable chunk of the SME, our data size would demand a big data engineer quite early.

  • Is there any known case of any other company exploiting the same data?

The risk of business is in the limits: when nobody is doing it and when everyone is doing it. If you know anyone who is exploiting the same data you are, then you are not that crazy, mimic your rivals outputs, “Good artists copy; great artists steal”, but giving a singular accent to your solution.

In Bitphy since we are retail experts, we knew quite well what data analysis the big brands were doing to enforce their sales. The thing is that in Spain over 93% of retail market are SME, so we have most of the players data unused. Besides it is difficult to access for third parties and with a business model validated by the big brands.

  • What is the availability of the data? i.e Technical infrastructure.

“Do machine learning like the great engineer you are, not like the great machine learning expert you are not” Martin Zinkevich

It does not matter how crucial, disruptive, and important any data is for a business. If the technical infrastructure is not ready for it, you better had to move to the next issue. I have seen scenarios in which after a deep analysis digging in a high value dataset, some found that the data source is refreshed manually. It is to keep the algorithms working, so one of the axioms of data science, which is automatize, is violated, and then the overall project is useless.

Before starting to write any data science script make sure you are aware of the technical limits and constraints of your dataset. Seriously, outside of the online world, this is probably the most important reason for a data project to fail.

There are beautiful and updated uses of technology besides the online industry. But the norm is to find outdated software and hardware. This is something very important to consider when you move from the online business to the offline, whatever it is: industry, retail, the machines they use are far from being coined for data use, and the software under the hood is usually more than a decade old.

  • How close can one work to the final data consumer?

“Keep your data close and your data consumer closer”. If you say that to Vito Corleone, he probably would not understand anything, but a UX expert will quickly undestand it. Everyone knows that UX without the user is just X, the same happens in data science. Many of the algorithms you might develop already exist in the minds of the workers. It could be something intuitive, but they already have some golden rules which you have to convert into a quantitative science. That is data science in a nutshell, you have to transform qualitative into quantitative.

In recommendation systems when an algorithms starts from scratch with a new user, it takes some time for it to classify the user. It is called the cold start problem. If you want to give value to any market, don’t start from scratch, learn the golden rules as quickly as possible and push them to the next level: automate and create new insights for business intelligence.

So again, it does not matter, if you have a dataset you believe is amazing in your hands, if you cannot work close to the people who are directly or indirectly generating such data. It will be very difficult for you to yield anything outstanding.

  • If the data is not yours, why is anyone going to share it with you?

It still surprises me how easily our first clients lend the credentials to us of their POS. Besides the fact of the chain of trust, there was another latent factor: They did not give enough importance to their data; and honestly, they had no reason for it, who would give value to a set of tables and rows filled with numbers and strings when your work consists of selling your fruit boxes?

Thus, the reasons for someone to share his data are diverse, despite the fact that, a benefit is expected and the relevance one gives to his data is proportionate to the benefit one expects from it. Plus, the relevance underlies in the technology, old technologies still in use perfectly accomplish the mission they were designed for and are a pain in the neck for data experts, however a notable side effect is that the less is expected from it in terms of data, the easier it is to convince the owner to give you the key that opens the data source.

If you passed all the previous checkpoints, it’s time for the data scientist.

Often data science projects can be demanding and very complex, can need a deep understanding of data analysts heuristics of the specific problem, and take hundreds of features and are integrated into a fine tuned neural network. For a normal size team, that means months of work, without any yield. Businessmen tremble under those circumstances. However a data science project is like Tetris. It is a puzzle of many small developments, every one of which solves a small problem. A good product owner of data science must understand all the subprocesses in a data science project and be capable of deriving value from each of them. Therefore, maybe, the final goal is delayed, but on the way, value is continuously delivered from the very beginning.

Note that data scientists are good at: automatizing a manual and not necessarily a silly task, plus generating new data. Allow free exploration and suggestion of algorithms out of the available data, taking the eye off of the business goal. From time to time, making your data scientists forget the business goal is the seed for the big product.

Finally, the business has to evaluate such backlog of products and, in case any of those have economic value, the cycle is completed and the data project is justified by itself. Time to rock!

Interesting reads.

Industry X.0: Realizing Digital Value in Industrial Sectors. By: Eric Schaeffer


0 replies

Leave a Reply

Want to join the discussion?
Feel free to contribute!

Leave a Reply

Your email address will not be published.