The Future of Data Science
…in which I succumb to the year-end claim-making which requires no accountable follow-up :-)
As we begrudgingly return to the daily grind, I’m throwing my hat in the circle for another “2021 predictions” article!
I’m mainly writing because most of the 2021 data science prediction articles I’ve read so far discuss all the great things about the next cool algorithm or new structures — which are all fascinating (well, perhaps not ALL). The general non-tech populace, however, is becoming more familiar with what exactly data is.
We as data scientists, however, often miss the [random] forest for the [gradient-boosted] tree (couldn’t help myself with the puns…). We often head straight for the data and start modeling, rather than asking the contextual questions that surround the data, like “where did it come from?” (you’ll notice in the CRISP-DM process, there’s no start point of “Data Collection”…).
Therefore, I foresee that the greatest trends will not necessarily be purely academic algorithmic advances or high-minded hypothetical hunches about potential data science, machine learning, and artificial intelligence, but rather increasing retroactive views of the field itself and a more thorough questioning of the already existing implementations. Thus, my predictions for the field aren’t which tools and languages will become the coolest / most relevant, but broader-scoped cultural trends surrounding data science. Anyway…
Data Science Trends for 2020:
1. Data Literacy
The amount of data we collect and store outpaces by far the resources we have to produce useful analysis of that data. At the same time, however, the results of predictions and analyses that we do have the capability to make have profound impacts on every aspect of our lives. Thus, the need for data literacy and education will increase at an even faster rate than it already has.
I recently did a project which assessed the impact of the “digital divide” in education (also known as the homework gap),where the divide was between those who had and didn’t have broadband / device access. What we’ll see in data science is a “data divide” — between those who have the resources to collect, store, and analyze a ton of data, and those who don’t. Imagine, for example, the small restaurant barely making ends meet who can’t afford the POS analytics package compared to the regional / national chain who can afford to buy the customized analytics-as-a-service as we navigate the uncertainty of the pandemic.
On an individual basis, this will mean rapidly increasing numbers of citizen analysts (like the 14 year old that created this covid-19 dashboard). These will be the thousands of people are currently passing through self-paced online platforms (e.g., Datacamp, Dataquest, freecodecamp, Data Checkpoint), bootcamps (e.g., Flatiron — the one I attended, General Assembly, Springboard, etc.) and University Certificate programs. Even at the most introductory of levels, the widespread availability of and access to materials for learning data science will produce a generation of people who do data science for fun, on the side, or at work. Many of these citizen analysts will certainly take the steps to complete these programs and become “data scientists” in order to avoid the analytic inequality gap.
Overall, as individuals in society, we’re going to have to learn how to separate the “signals from the noise” as data science becomes a more regular part of our daily conversations.
2. Data Transparency
The language of Data Science and Machine Learning has become so specialized that thus far, it has been left to the Master’s degrees and PhDs in CompSci or similar to create and analyze algorithms for industries across the board. As data literacy and education increase, however, people with broader perspectives and interdisciplinary academic backgrounds will continue to bring an even sharper critical eye to how we collect, organize, analyze, and communicate data. These new perspectives will bring long overdue questions and critiques of the algorithms that have been generated in isolation by data scientists and engineers — requiring the opening of the so-called “black box” algorithms, to the extent capable.
We’ll start to see more mainstream considerations of things like algorithmic accountability auditing to open these so-called black boxes of machine learning to explain how they achieved the results presented. The “that’s what the algorithm said” excuse will begin to disappear as citizen analysts and others who become more familiar with the language of machine learning and data science start demanding answers from the creators of these algorithms.
Light will be shone on these algorithmic black boxes through further research into creating and communicating interpretable machine learning models, as well as through more intentional feature engineering in deriving predictions. Critical analysis of feature importances will become foundational in analyzing the transparency of a model — which businesses will value.
3. Data Ethics
The transition to data transparency will be guided by and grounded in the widespread implementation of data ethics, especially when GDPR and similar legislation begin to take hold and become validated (or rejected) in the courts. As we move increasingly towards recommendation systems over simple predictions, the business value of transparency through explainable models will increase as business leaders incorporate more ethics into their data-driven decision making. This means that machine learning and AI systems will have to not only provide more clear explanations for their results, but also include next best options and basic information about the costs and tradeoffs. This will apply to everything from weighting the balances of environmental justice with profitability, to criminal justice and court efficiency.
More generally, we can expect to see a newly emboldened revival of the privacy vs. security debate as data laypeople begin to question things like whether or not their neighbor’s Ring camera system has the right to recognize their face every time they pass. With the ever-increasing collection of data from things like IoT devices, the question of privacy and to what extent any company owns and sells our data will bring itself to the fore again in 2021.
Perhaps — and this is dreaming, to be clear — the citizens of the United States will pass a Data Bill of Rights, similar to the GDPR, establishing a US privacy standard for the ever-increasing data harvesting and behavioral data mining committed today.
Finally, as #BLM continues to impact our culture, the conversation surrounding algorithmic bias and its role in institutional racism will certainly extend into the data science realm. We as data scientists will increasingly recognize that numbers are not neutral and data is not race-blind. Businesses will have to understand both the intended and unintended consequences of each decision-making point in the data science process, addressing the compound biased and racial outputs in their data (collection, storage, analysis and prediction). In order to address this most effectively, tech companies in the data science market will be held to account for their diversity hiring, and will make intentional efforts to increase BIPOC representation and incorporate diverse opinion onto their data science teams.
4. Data Accessibility
How enterprises and consumers alike collect, store, access, and analyze their data will become an [even more substantial] industry of its own.
Continuing on the trend towards data ethics (and as the big tech companies attempt to prevent the coming wave of privacy lawsuits) we’ll see the further expansion of what I called as an RA in college, “challenge by choice” — essentially, opt-out by default will become the new standard towards consumer collection, over the current opt-in default.
In terms of storage, access, and analysis, two opposing trends will take further root. The first trend will show that cloud platforms like AWS and Google Cloud will gain strength amongst enterprises and small businesses that don’t have the resources to develop their own data architecture, but still want reap the benefits of the access to big data. These platforms will continue to grow beyond just providing cloud data architecture and computing power into providing their own form of analytics-as-a-service.
At the same time, however, the second trend will reveal enterprises which have the technical resources and financial capability to build their own data structure, and not depend completely on the juggernauts in the cloud architecture arena for the data science needs. We’ll see this through the disaggregated stack (where SQL will ground itself as the common language) and the increased popularity of Data Lakes. It’ll be like open source for big data / cloud architecture, as companies move towards the multi-cloud multi-platform approach.
For [a consumer tech] example, the mobile phone market provides a great analogy (I’ll declare my bias here that I prefer the Android side of things). iPhone users, presumably, just want one device to handle their day-to-day on-the-go technology usage, where everything works together. Knowing this, Apple has been augmenting their device infrastructure capabilities with service offerings within that walled garden. On the other side of things, Android users (like me) presumably wanted the potential benefits of open source — having options provided by Samsung, Nokia, Google, etc. Perhaps Raspberry Pi provides and even more apt example: modularizing mobile devices so that consumers could pick and choose the parts that mattered most to them.
At the apex of data accessibility, we’ll find an increasing movement towards containerized production through platforms like Docker and Kubernetes. Production will be containerized to an even greater extent and scale in a way that enables enterprises to move between various architectures and cloud setups.
5. Data Storytelling
One of the last steps of the data science process is communicating findings to the stakeholders involved and recommending action. Often these steps are automated; however, as data transparency increases, and humans are required to make more ethical decisions, data storytelling will become one of the main forms of communicating data findings. It will require empowering non-scientific analyses to make the business case through narrative storytelling, and not just numbers.
Specifically, we’ll find plenty of articles to read, podcasts to listen to, and webinars to watch about ditching the dashboard. While the dashboard won’t disappear, an increase in data literacy will enable data scientists and amateurs alike to create dashboards very easily. But at the end of the day, the real value-added data science comes from producing action and not just presenting numbers or charts. Those with the ability to create a user story by explaining the Whys and Hows, as opposed to just the Whats and How Manys, will rise above the pack and provide unique value to stakeholders and decisionmakers. While a dashboard can show predictions, effective data storytelling will provide actionable insights beyond “simple” predictions.
As we see the increase in value of recommendation engines in providing actionable results from these user stories, ethics and transparency showing the more human side of data science will shape their output. No longer will black box models predicting the highest profit be the best algorithm. Instead, the most effective recommendation engines will provide a human user with options and next best recommendations, providing a narrative insight as to what the tradeoffs, risks, and costs associated with each recommended model are, as previously stated.
For [a very technical] example, if a simpler more explainable model requiring fewer datapoints produces an adjusted R² value which is within, say, one thousandth of the adjusted R² value for a much larger model requiring more datapoints from the consumer and was very much a black box (ceteris paribus, of course), the company could choose that the gain in predictive power is not worth the ethical cost to unnecessarily gathering extra consumer data without being able to explain how it works.
Thanks for reading this far! I’ve added two bonuses which I foresee but have not done enough research on yet…
- Graph technology — I honestly have more to learn here, but the idea is that with the magnitude and scale of data collection, storage and analysis increasing, graph technology analyzes not the data itself, but the relationship between the data structures in facilitating decision-making. Metadata will become more valuable than the data itself.
- Full-stack data scientist — This already exists, but the term data scientist has been diluted. Companies will begin to search for employees who can access, merge, analyze, visualize, communicate, AND produce recommendations from datasets for data-based decisions. From querying the databases to deploying a dashboard or app, data scientists with a complete stack will become even more valuable than those skilled in specialized areas, given the rapid rate of continuous change in the industry.
So that’s it for now! Data literacy, transparency, ethics, accessibility , and storytelling are the trends for the future, with graph technology and the full stack data scientist supporting them.
I haven’t really been a huge fan of New Year’s “resolutions” or “predictions”… mostly because, besides getting lost in the cacophony of resolutions, we rarely take the opportunity to actually hold ourselves and others accountable for the predictions made and follow through on their evaluation. So hold me accountable! What’d I miss? Who am I ignoring? Am I totally wrong? Find me on LinkedIn or e-mail me: firstname.lastname@example.org — I look forward to hearing from you!