November 4, 2020
DataOps: the secret behind successful ML initiatives
by Anastasiya Parkhomenko | 5 min read
November 4, 2020
by Anastasiya Parkhomenko | 5 min read
Sophisticated models and ML-enabled applications are now delivering new capabilities in computer vision, fraud detection, voice computing, and more. This breakthrough is driving demand for clean and representative data.
Each ML model has a complex life cycle covering development, training, simulation and validation, deployment and, finally, operation, with multiple iterative loops along this way. For example, after simulating and validating the model by comparing its results with a production model, data scientists might need to train it again with additional data. Moreover, after deployment, it is often the case that the model performance degrades over time as production datasets evolve. Then it is back to development, training and simulation.
There is ample room for errors, false alerts and lost investments, and only a small number of projects make it to day-to-day business operations. Lack of accurate data affects development and training, leading to ineffective algorithms and poor models. Unreliable data pipelines disrupt the ability of ML systems to deliver business value on time. A small fraction of effort is typically dedicated to innovation due to time-consuming and repetitive data engineering and pipeline construction tasks. The volumes, complexity and accessibility of data make continuous evaluation of the deployed model, tuning and re-deployment a tough challenge. These issues make DataOps crucial for ML systems where full, clean datasets are constantly required to develop, test, tune and deploy effective algorithms.
Before embarking on enterprise-scale ML projects, organisations must lay down the foundation of the data-driven culture. To start with, it means enabling self-serve data exploration and standardising common data landscape across the organisation.
Probably, the best way to achieve this is adopting DataOps. DataOps is an agile way of developing, deploying and operating data-intensive initiatives such as ML projects. DataOps promotes data democracy mindset, encouraging exploration and managing data pipelines in an automated way for various users, who mainly comprise data owners, data managers, data scientists, analysts, developers and business users. Nevertheless, DataOps also ensures the access to data is controlled, protecting privacy, usage restrictions, user permissions and data integrity.
Check out how our DataOps platform empowers Big data teams with Data Governance and Data lifecycle management.
The typical components of end-to-end Machine Learning & Data Science projects is summarised in the chart below:
The following DataOps techniques are pivotal for enterprises to optimise and secure the process of end-to-end Machine Learning and Data Science:
1. Cross-functional goal-oriented teams. With the ML field rapidly evolving, each team member must align individual expertise with the project’s goals. Efficient teams involve data owners, data managers (e.g. data architect or data engineer) and data consumers (e.g. data scientist or analyst).
2. Data discovery and exploration. DataOps breaks data silos providing centralised access to all metadata from all data sources keeping track of provenance and ownership. Automated data discovery and mapping can help save plenty of time to invest in innovation.
3. Automated data delivery processes. Easy, seamless and automatic pipeline deployment is vitally important to prevent bottlenecking in time-sensitive ML projects. DataOps promotes codeless design and automated deployment, monitoring and scheduling of reusable pipelines as the best practice.
4. Automated data governance policies. Rules and governance should be applied directly to data fed to ML. It is important to embed data anonymisation and retention policies in every pipeline, and log, track and trace all data requests.
5. Treating data like code. Continuous transformations of existing datasets are inevitable in ML projects but all changes should be versioned, stored, and treated as code to mitigate risks and support iterative development processes.
6. Human centered framework. Creating channels for continuous communication, feedback and project-level notifications is the cornerstone of agile approaches, including DataOps. Furthermore, it is important to make sure everyone’s roles and responsibilities are clear and transparent.
Data science and machine learning are here to stay, and will continue to have a profound impact on organisations across all industries. The faster organisations can train and deploy effective models into production, the more competitive advantage they’ll earn. The best practices and proven techniques of DataOps empower businesses to accelerate time-to value and drive more frequent releases at higher quality, for drastically better business outcomes.