Machine learning, artificial neural networks and social research

Read original article here

Machine learning, artificial neural networks and social research
Metrics details
Abstract
Machine learning (ML), and particularly algorithms based on artificial neural networks (ANNs), constitute a field of research lying at the intersection of different disciplines such as mathematics, statistics, computer science and neuroscience. This approach is characterized by the use of algorithms to extract knowledge from large and heterogeneous data sets. In addition to offering a brief introduction to ANN algorithms-based ML, in this paper we will focus our attention on its possible applications in the social sciences and, in particular, on its potential in the data analysis procedures. In this regard, we will provide three examples of applications on sociological data to assess the impact of ML in the study of relationships between variables. Finally, we will compare the potential of ML with traditional data analysis models.
Introduction
ML is an automatic learning process that takes place through the processing of usually very large data sets. The procedures of the past, defined with the “symbolic artificial intelligence” label, operated on algorithms constituted by a logical set of instructions by which a given output (usually called target) was encoded for all possible inputs. Contrarily, the new ML systems “learn” directly from data and estimate mathematical functions that discover representations of some input, or learn to link one or more inputs to one or more outputs to be able to formulate predictions on new data (Jordan and Mitchell 2015 ).
In recent years in various human sciences: economics (Varian 2014 ; Blumenstock et al. 2015 ; Athey and Imbens 2017 ; Mullainathan and Spiess 2017 ), political science (Baldassarri and Goldberg 2014 ; Bonikowski and DiMaggio 2016 ), sociology (Barocas and Selbst 2016 ; Evans and Aceves 2016 ; Baldassarri and Abascal 2017 ), communication science (Hopkins and King 2010 ; Grimmer and Stewart 2013 ; Bail 2014 ), etc., ML has started to be applied both in academic research and in areas related to the management of services provided by the public administration (Athey 2017 ; Berk et al. 2018 ) or by private companies.
Overall, many different approaches and tools are included under the ML label (Kleinberg et al. 2015 ). Here we will only consider ANNs that use supervised ML algorithms. In the supervised ML the algorithm observes an output for each input. This output gives the algorithm a target to predict and acts as a “teacher”. On the contrary, unsupervised ML algorithms only observe the input and their task is to independently compute a function without a predetermined target (Hastie et al. 2009 ; Molina and Garip 2019 ). The goal of this paper is to apply ANNs to sociological data by comparing the results obtained with the results of traditional statistical techniques, to evaluate their strengths and weaknesses.
Short illustration of artificial intelligence and machine learning based on artificial neural networks
Artificial intelligence (AI) is a branch of computer science that encompasses a huge variety of computational operations, ranging from classical algorithmic production to ML and deep learning (DL) techniques (Russell and Norvig 2010 ; Kitchin 2014b ). The substantial difference between these approaches is that while traditional AI problem solving methods are based on if–then rules, ML and DL seek to iteratively evolve an understanding of data sets without the need to explicitly code any rules. This allow the computing system on which they are implemented to automatically learn and make predictions starting from a set of input data, adjusting their parameters by optimizing a performance criterion defined on the data and reducing the error rate at each stage of the learning process (Alpaydin 2016 ; Goodfellow et al. 2016 ).
In other words, in ML the aim is to construct a software program that adapt and learn independently, that is, without having a pre-programmed system that establishes how it should behave. Algorithms can learn from their mistakes thanks to training data used as examples. Accordingly, how much a model learns depends on the quality and amount of example data to which it has been exposed (Nilsson 2010 ; Dong 2017 ).
The considerable availability of information, due to the deluge of big data gathered from all kinds of specialized sensors and digital devices, and the rapid growth in parallel and distributed computing systems, made possible by the advance of faster CPUs, the advent of general purpose GPUs, the use of faster network connectivity and better software infrastructure for distributed computing, have given a boost to this sector (National Research Council 2013 ; Schmidhuber 2015 ; Goodfellow et al. 2016 ). AI applications are constantly evolving, reaching high levels of complexity and fascinating results in many different tasks: language translation, speech recognition, visual processing, spam filtering, and so on.
It is intuitive how companies capable of collecting and storing data correctly are candidates to be at the top of the AI sector. Many of the applications of DL are highly profitable (Goodfellow et al. 2016 ; Zuboff 2019 ). Footnote 1 Indeed, despite the emphasis around the state of the art, most big tech companies still use traditional ML models instead of more advanced DL, and depend on a traditional infrastructure of tools poorly suited to ML (Dong 2017 ).
Early findings of DL date back at least to the 1960s, when it was intended to be a computational model of biological learning, that is, a model of how learning happens or could happen in the brain. As a result, one of the names that DL has gone by is ANNs. (Schmidhuber 2015 ; Goodfellow et al. 2016 ). The two terms are often used as synonyms. To be precise, DL is a subfield of ANNs, that uses multi-layered neural networks to process information. The idea behind deep neural networks is that, starting from the raw input, each hidden layer—so named because its values are not given in the data—combines the values in its preceding layer and learns more complicated functions of the input. It is difficult for a computer to understand the meaning of raw input data. DL resolves this difficulty breaking the desired task into a series of nested concepts, each described by a different layer of the model (LeCun et al. 2015 ; Alpaydin 2016 ; Goodfellow et al. 2016 ).
There is no consensus about how much depth a model requires to qualify as deep. Discussions with DL experts have not yet yielded a conclusive response to this question. However, DL can be safely understood as the set of models that involve a greater amount of composition of either learned functions or learned concepts than traditional ML does (Schmidhuber 2015 ; Goodfellow et al. 2016 ).
DL is not a breakthrough in the scientific sense, rather it is a relevant breakthrough in efficient coding that makes a difference in several contexts. In practical applications, DL is able to achieve higher accuracy on more complex tasks as compared with traditional ANNs, although it requires more computational resources. Furthermore, DL needs less manual interference to craft the right features or the suitable transformations of data. It performs exceptionally precise operations on data that come from different modalities, such as images, texts and videos (Schmidhuber 2015 ; Alpaydin 2016 ; Goodfellow et al. 2016 ).
In summary, ML offers numerous mathematical tools to deal with a wide variety of problems. The main tool, very popular nowadays, are the ANNs, which are trained to solve a particular task. Neurons are organized into groups called layers and connected to each other precisely to form a network. As mentioned, when the number of layers is high, the neural network is defined as deep. The DL’s approach attempts to mathematically model the way in which the human brain processes information in vision and hearing: the stimuli of eyes and ears, passing through the human brain, are initially broken down into simple concepts and gradually reconstructed in increasingly complex and abstract representations (Russell and Norvig 2010 ; Alpaydin 2016 ; Goodfellow et al. 2016 ).
Similarly, in a deep network a face is broken down in the form of an array of pixel values. The first layer can easily identify edges of different orientations. Subsequent layers combine these to form corners and extended contours. Layers that follow can detect entire parts of specific objects, by finding specific collections of contours and corners. Finally, these in turn are combined with some more layers of processing, allowing us to represent the faces we want to learn (Nilsson 2010 ; LeCun et al. 2015 ; Alpaydin 2016 ; Goodfellow et al. 2016 ).
So, the choice between ML or DL algorithms depends on the problem to be analyzed. If the problem is relatively simple, it is preferable to use ML based on ANNs with few layers of hidden units; if the problem is complex or requires the achievement of very specific and rigorous objectives, it is considered more useful to resort to DL.
Methodology
The starting point of our experiments is to evaluate whether, in the typical data analysis operations of social sciences, the techniques of ML based on ANNs can constitute an alternative, or at least a possible integration, with respect to the traditional data analysis tools which basically consist of linear and logistic regression models.
As is known, in general, multivariate data analysis models perform an empirical control of one or more hypotheses derived from a theory and the results consist in the comparison between the so-called expected, or theoretical, data and the empirical data. If the outcome of this comparison is attributable to random effects, it is said that the model fits, or is compatible, with the data; otherwise the model must be revised or, if this is impossible, rejected (Di Franco 2017 ). It is therefore the so-called confirmatory-explanatory approach.
Starting from the work of data analysis pioneers such as Fisher ( 1925 , 1935 ), Galton ( 1869 , 1886 ), Spearman ( 1904 , 1927 ) and many others, for many decades data analysis in the social sciences has been characterized by this approach which fundamentally seeks to identify, from associations between a set of empirically detected variables, causal links between the same variables. In this context the model (i.e. the theory) is prevalent and the data are used to evaluate the goodness of fit of the model, expressed by the values of a coefficient of statistical significance (p-value).
Alternative approaches to data analysis, based on induction, exploration-description, simulation, etc. which have also been proposed over time (among others by Benzécri 1969 , 1992 ; Benzécri et al. 1973a , b ; Tukey 1977 ; Gifi 1981 , 1990 ) have received less interest among social sciences researchers. The characteristic of these alternative approaches is the inversion of the relationship between data and theory: data are more important than the model. This means that starting from the data it is necessary to find the model that best represent them; while in the causal approach the starting point is a model and the data are used to test it.
Thanks to the recent developments in different disciplines such as applied mathematics, statistics, information technology, approaches based on data prevalence have become established, or are emerging, in many disciplinary areas of natural and biometric sciences. Over time these approaches have taken on different names such as data mining, statistical learning, machine learning (ML), deep learning (DL) and others.
In addition to the innovations to which we referred, starting from the development in information and communication technologies and web platforms, the current historical period is strongly characterized by the so-called big data and their management through mathematical algorithms that are able to independently process them to extract information useful for various purposes. As a result, many ML techniques exist today. A common feature of these techniques is that they are exploratory and rely on computer assisted analysis.
One large subdivision of these techniques uses a single outcome and tries to make an optimal prediction of this outcome from multiple predictor variables (supervised learning techniques). The second subdivision does not require any outcome and merely classifies inputs into subgroups based on similarities among a set of variables (unsupervised learning techniques).
For the purpose of our experiments we will use the ML which adopts the ANNs which have units arranged on three layers (input, hidden and output) and unidirectional connections between each unit of one layer and all the other units of the next layer.
Being essentially a distributed processor built in analogy with the human central nervous system, an ANN is generally composed of elementary computational units called neurons, conceivable as nodes of a network with certain processing capacities and interconnected. Footnote 2 Artificial neurons are able to receive a combination of signals from the outside or from other neurons, and then transform them through a particular function called activation function, thus storing data in the network parameters and in particular in the weights associated with every connection.
Then there is the return of an output: a result generally dependent on the purpose for which the ANN was built (classification, recognition, approximation, etc.).
The relationship between incoming and outgoing data is generally determined:
From the type of elementary units used: complexity of the internal structure, class of activation function used;
From the formal structure of the network: number, orientation and direction of the nodes, which can be represented according to the tool of graph theory;
From the values of the internal parameters associated with the neurons and the related interconnections: to be determined using appropriate learning algorithms.
The question we ask ourselves is whether ANN can be usefully applied in social research, besides as a complex of nonlinear data processing algorithms, also as a tool to simulate social phenomena (Capecchi 1996 ).
It is difficult to assimilate social phenomena to neurophysiological ones; for this reason, the analogies of the nodes of an ANN with neurons, of its connections with synapses, etc., that are possible in the study of the brain, are not possible in these other cases. However, it is a question of assessing whether the abstractness of the structures and processes postulated in ANNs, understood as models of complex nonlinear dynamic systems, does allow their application also to the study of social phenomena. In this case it is necessary to determine the interpretation to be given to concepts such as node, connection, excitation/inhibition, connection weight, learning rule, equilibrium and so on.
On the other hand, the use of ANNs allows the possibility of partially overcoming some limitations of the analyses conducted with traditional statistical techniques. For example, the use of ANNs does not require any hypothesis on the distributions of the system variables and their reciprocal associations. For this reason, the treatment of cardinal, ordinal and/or categorical variables is possible (Di Franco 2017 ). By such approach the actual analysis of the system is left to the network, which alone creates its own criteria to reproduce its behaviour and consequently enables itself to formulate predictions on the system itself. In Fabbri and Orsini’s ( 1993 ) judgement, this is both a strength and a weakness of ANNs: it is a strength because in this way the researcher is not conditioned by a priori hypotheses in the choice of the units of the network; the weakness consists in the fact that the network is not able to do anything else but reproduce in a phenomenological manner the behaviour of the analysed system, without contributing to the knowledge of the internal relationships between the single parts of the system. This problem, however, can be partially overcome as some devices, that allow us to interrogate the network about what it was able to reproduce, have been fine-tuned. (Di Franco 1998 ).
If the simulation approach of ANNs to social phenomena proved to be possible and useful (Capecchi et al. 2010 ), this would allow significant progress in the social disciplines because it would also contribute to the foundation of a consistent basis of simulation concepts, models and techniques. If social phenomena can be thought of as complex dynamic systems then it is necessary to accept the possibility of simulating them on a computer with more meaningful results than those obtainable with traditional data analysis tools.
We now describe the methodology used in the examples whose results we present in the next paragraph. The data used in the three examples are taken from a matrix containing some information on the electoral polls published in Italy by the mass media from 1 January 2017 to 29 February 2020. The information relating to these electoral polls was downloaded from the institutional website of the Presidency of the Council of Ministers: www.sondaggipoliticoelettorali.it .
In the period indicated above we collected 825 polls focused on voting intentions for the next political elections. As mentioned, the results of these polls have been disseminated by the mass media and are governed by rules that require the drafting of an information note that presents methodological information useful for assessing the correctness of the polls carried out by the various agencies (Di Franco 2018 ).
The Italian regulation on the publication and dissemination of electoral polls in the mass media lists the information that must compulsorily be inserted in the document that is published on the institutional website. These are the fifteen information items:
1.
Subject who carried out the poll;
3.
Date or period in which the poll was carried out;
6.
Name of the mass media in which the poll is published or disseminated;
7.
Date of publication or diffusion;
8.
Topics covered by the poll;
9.
Territorial extension of the poll;
11.
Representativeness of the sample including indication of sampling error;
13.
Sample size, number and percentage of non-respondents and replacement made;
15.
Full text of all questions and percentage of people who answered each.
From our analysis it emerged that in many documents there are important gaps with respect to what is required by current legislation, especially in relation to purely methodological information.
To assess the quality of the documents as a whole, we have developed a completeness index of the poll information, adding the presence of the following six elements on which we have identified the most critical issues:
1.
The proportions between the breakdown of interviews conducted with mixed interview methods;
2.
The confidence interval for the estimates;
3.
The number of subjects contacted;
4.
The number of refusals and replacements for the interviews carried out;
5.
Full size table
On average, the polls analyzed were carried out in just over two days (2.6 days; 1966 the standard deviation; 1 the minimum value; 25 the maximum value).
The sample sizes vary in a range from 500 to 16,000 cases; the average is 1243.62, the standard deviation is 779.537.
Linked to the size of the sample is the level of sampling error that in the analyzed polls varies between 1.3 and 4.4%. The average error is 3%.
Finally, with regard to the request to provide information on the number and percentage of subjects who do not answer the poll questions in our analysis—since we have considered only the question relating to voting intentions, whose formulation is: “if you voted today [or, if you had voted yesterday] for the Chamber of Deputies, which party would you vote for [or, would you have voted]?”—we have taken into consideration the presence of the percentages of the undecideds and those who intend to abstain from voting.
In 28.36% (234 cases) of the polls, neither the percentage of undecideds nor that of abstainers was reported.
Results and discussion
The first example consists of a comparison between a multiple linear regression model and an ANN Multilayer Perceptron.
We first present the results of multiple linear regression. The dependent variable is the percentage of voters who declared their intention to abstain or who declared their indecision regarding the election choice (label ‘no-vot’). The independent variables are the following four: the duration of the poll in days (label ‘days’); the sample size (label ‘n-sample’); the completeness index of the information relating to the poll (label ‘ind-1′); the ratio between the interview attempts and the interviews carried out (‘ind-2′).
Table 2 presents the fitting results of the multiple regression model. Considering the adjusted R square, we find that the four independent variables reproduce a little less than a third (31.1%) of the variance of the dependent variable. Table 3 shows the regression coefficients and Table 4 the residual statistics.
Table 2 Multiple regression model summary
Full size table
Also in the third example the ANN results are better than those of the multinomial logistic regression model. On the whole, if we consider the results of the training, ANN reaches 94.5% of correct classifications against 92.3% of the multinomial model. Even if we take into consideration the results of ANN testing, they are, albeit slightly, better (93.2%). As for the single categories of the dependent variable, ANN achieves the best performance with the panel category (100% of correct classifications for both training and testing set) and with the CATI-CAMI-CAWI category (96.9% for the training set and 94.1% for the testing set).
For the other three categories of the dependent variable (CATI, CATI-CAMI and CATI-CAWI) the percentage of correct classifications varies from 86.5 to 91.2% for the training set and from 82.6 to 93.1% for the testing set.
Conclusions
At the end of this excursus on feedforward ANNs we can summarize the most important aspects by highlighting their strengths and weaknesses.
As the phenomenon of generalization demonstrates, ANNs are capable of learning, namely, they allow solving problems by associating the sought solution with data. Indeed, network learning techniques are applications of known statistical methods (stochastic approximation) to a new class of nonlinear regression models. In this sense the determination of the network weights can be interpreted as a nonlinear regression applied to an ANN function. The advantage is to have an extremely flexible function, avoiding the subjective components of the specification error, as the parameters implicitly determine which is the latent function that a network approximates.
If the analytical form of the function underlying the problem under study is known, or can be assimilated to a known form, the problem of parameter estimation refers to the case of nonlinear least squares and the use of ANNs is not justified; it becomes so when one is not able to formulate reliable conjectures on such form. In this case, the use of networks is easier and more productive than other complex procedures with restrictive assumptions. The use of ANNs is therefore effective as a criterion for identifying hidden nonlinear relationships.
The ability to learn is related to that to forecast. ANNs offer good performances both in univariate forecasting, that is, when one wants to predict the behaviour of a variable of a system that evolves over time on the basis of its past trend, and in multivariate analysis, when trying to predict the trend of a variable observing the past behaviour of several variables of the evolving system. Many studies have highlighted how ANNs allow good approximations and extrapolations to be made. Since a forecast problem can be referred to an approximation and extrapolation problem, it is possible to use networks to approximate the regularities present in the variations over time of the variable to be predicted. ANNs flexibly adapt to complex situations that change over time, directly if learning is unsupervised; by re-training if learning is supervised. They are also suitable for processing data that are incomplete or affected by noise or biases. By virtue of this ability to adapt to data, ANNs are very robust, viz. they have a high resistance to failures and malfunctions. Another important feature is the computational speed that derives from their parallelism and the very rapid input–output association, since the computations to be performed are weighted sums and threshold selections; therefore, they constitute a valid alternative to traditional techniques for performing complex computations.
Obviously, ANNs are not magical boxes. As we have made clear, ANNs can achieve better performance than linear methods if there are nonlinearities and interactions in input data. It should be kept in mind that just because we have data, it does not mean that there are underlying rules that can be learned. ANNs offer an approach to analysis that is data-intensive and exploratory. The focus of these methods is on computational efficiency, not modelling. Of course, the results will not necessarily be good unless the variables are. As the old adage of computer science goes: “garbage in, garbage out”.
The critical points of ANNs are, first of all, the long and scarcely incremental learning; in addition to requiring a large number of epochs before significantly reducing the error, learning must be repeated when the situation represented by the patterns undergoes substantial changes, unless such learning is continuous or unsupervised.
Obviously also for ANNs, as in any other case, it is necessary to have a data set that is rich and representative (of the problem under study) so that the training set and the testing set are effectively controllable.
Other problems may arise from the low accuracy and their uncertain reliability of the results provided by ANNs: the past performances of a network do not guarantee those future. There is a risk that the generalization is not complete and that therefore most of the inputs do not recall correct outputs. Furthermore, there are no strict criteria to design the most suitable network for a given problem, but it is necessary to proceed by trial and error with, as mentioned, numerous degrees of freedom in the choice of each parameter. Moreover, each network has its own specificity. If the same experiment is repeated on another network, there will not be the same results, although in most cases they tend to converge. This is another interesting feature of ANNs; they are able to provide similar results in terms of performance with a variety of weight settings. Clearly what is important is not the value of a certain weight, but the overall set of all connection weights.
Finally, the criticism most frequently raised against the usefulness of ANNs is that, even when they succeed in the assigned task, they do not allow us to explain their operation on a cognitive level (in the case of the sociological research we could say on the level of the analysis of relationships between variables). We expect from a model not only that it will be able to predict or reproduce its referent, but also that it will be transparent, that is, it will make us understand how it works, what mechanisms, processes and principles are behind it. ANNs, according to this criticism, risk obtaining the first goal, but not the second one. A network that was able to learn a certain task and is also capable of extending its performance to new situations, showing in this way that it has incorporated the mechanisms and principles underlying that task, may nevertheless be not very transparent as to these mechanisms and principles, not making them emerge clearly and thus not allowing their full explanation regarding the phenomenon in question. Their strictly quantitative nature, the interweaving of the links, the connection weights, the effects of a local phenomenon of activation on the rest of the network, are all factors that make the behaviour of networks dark as tools for explaining the relationships between variables.
Notes
1.
We must not forget that drivers and aspirations of corporations are quite different from the aims of academic researchers: we distinguish the former as motivated by financial concerns (e.g. prediction and control for product improvement and to identify new markets and opportunities), whereas the latter is focused on—at least should—the search for understanding and explanation of phenomena and processes (Crawford 2013 ; Kitchin 2014a ; Lagoze 2014 ; Törnberg and Törnberg 2018 ).
2.
Neurons are typically arranged along horizontal lines called layers, they communicate with neurons in the lower and upper layers by transforming the signals from layer to layer non-linearly. Their weights are iteratively modified thanks to certain ML algorithms – one of the best known and most useful is that of the stochastic gradient descent.
3.
For the ANN applications we used the Multilayer Perceptron procedure available in SPSS for Windows.
References
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/ .
About this article
Cite this article
Di Franco, G., Santurro, M. Machine learning, artificial neural networks and social research. Qual Quant (2020). https://doi.org/10.1007/s11135-020-01037-y

Images Powered by Shutterstock

The Data Daily

Machine learning, artificial neural networks and social research