What role can data play in AI?

Creating and developing artificial intelligence essentially requires 3 ingredients: data, an algorithm and computing power. Today's algorithms have an increasing number of parameters (175 billion for ChatGPT 3.5) and they need to feed off a very large volume of data in order to be trained correctly and come up with relevant responses.

For several decades now, humanity as a whole has been producing an exponential volume of data. Most of this data comes from the internet, connected objects or is produced by companies.

To fully appreciate the quantity of data generated by our activities, it is useful to look more closely at these data sources. The world of connected objects (known as the Internet of Things or IoT) emerged in the 1980s. The IoT developed steadily and rapidly over the next forty years. Here, we are referring to sensors or measurement tools that are used and present in industrial processes, "smart buildings", mobility and transport.

In reality, connected objects can be found everywhere! The temperature and humidity sensor in my home is a connected object generating data that could be used to feed into an AI.

Initially, IoT devices were limited to collecting and transmitting data. Today, with the evolution of machine learning and very lightweight integration frameworks, these devices can be configured to initiate their own actions or respond to events in their environment.

According to US market research firm IoT Analytics, 14.4 billion connected objects were registered in 2022. This number is expected to rise to 27 billion by 2025.

But data is not just about IoT! Another prominent member of the family of data producers is the internet. It is an acknowledged fact that the internet is a huge collector and generator of data. In October 2022, the number of internet users on the planet reached 5.07 billion (up 3.5% on the previous year).

To get a clearer idea of what this represents, here are a few revealing examples of the volume of data produced on the internet in the space of 1 minute (September 2022):

9.6 million searches on Google
231.4 million emails
104,600 days of Zoom video calls
1 million hours of video streaming
443,000 dollars in sales on Amazon
1.1 million Tinder user swipes
66,000 photos on Instagram

Quite mind-blowing figures!

And finally, there is the huge volume of data produced by companies. Data which is often exploited "little, not at all or badly", while the volume of data produced in industry continues to grow. It is currently estimated at 2.02 petabytes (a petabyte is equivalent to 1,000,000 gigabytes).

To comprehend the issues emerging around data, one key indicator is that 80% of the world's data produced in 2025 will be unstructured. In other words, it will not be organised... to put it even more simply, it will be "on the loose". Artificial intelligence training techniques are therefore moving increasingly towards deep learning and neural networks.

In conclusion, we live in a digital world that produces an ever-growing quantity of (unstructured) data.

And that's just as well, because the fuel that artificial intelligence runs on is data. As the number of parameters in deep learning algorithms continues to grow (soon to be in the billions of billions), it is also essential to have more and more data to improve the machine's performance.

It's often said that, in the field of artificial intelligence, 80% of the work lies in the data. This much is true.

Data can throw up several common problems for AI. Firstly, when the data used to train models is incomplete or inaccurate, this affects the performance and reliability of the models. Secondly, biases in the data can be introduced in a number of ways, including during collection, sample selection or the labelling (or qualification) process. These biases can be reflected in the predictions of AI models. This can lead to discriminatory or unfair results. Lack of diversity in data is a problem for AI. It can lead to models that are no longer relevant for populations or situations different from those on which they were trained. This will lead to less accurate, less reliable results in real contexts. Similarly, the evolution of data over time should also be a point of attention. Models need to be regularly updated with new data to remain relevant and accurate in constantly changing environments. The icing on the cake is that collecting, storing and managing these massive volumes of data often presents challenges in terms of cost, infrastructure, confidentiality and regulatory compliance.

Not an easy task!

To improve the quality of data used in its operations, Egis has developed know-how and expertise in data characterisation and qualification. This expertise is based on an in-depth understanding of the various aspects of data (its structure, quality, reliability and relevance). In more detail: it is often necessary to characterise data, i.e., identify and describe the fundamental characteristics of data sets. This requires an in-depth understanding of the nature, structure and properties of the data so that it can be analysed and used appropriately. Egis has this expertise.

This expert work helps produce better-characterised data that will enhance the performance of our algorithms.

In the same way, Egis works on data qualification. This is another essential stage in data management, aimed at ensuring that the data used is reliable, accurate and adapted to the specific objective or context. This is often laborious work, but it is essential if you want to develop high-performance AI.

The work consists of checking the quality of the data in terms of accuracy, consistency, integrity and completeness. Ensuring that the data is free of errors, duplicates or missing values, and that it complies with the defined quality standards. Also checking that the data is relevant and that it contains the data needed to train or run the algorithm. Of course, it must be valid, meaning that it must faithfully represent the phenomena or objects that it is supposed to represent.

In any artificial intelligence project, data is essential. It must be processed to obtain effective results. This is why Egis has developed this area of expertise. It is a solid foundation for developing and deploying artificial intelligence.

What role can data play in AI?

Insight Tags.