Logo

The Data Daily

Intuitive physics learning in a deep-learning model inspired by developmental psychology - Nature Human Behaviour

Intuitive physics learning in a deep-learning model inspired by developmental psychology - Nature Human Behaviour

Intuitive physics learning in a deep-learning model inspired by developmental psychology
Psychology
Abstract
‘Intuitive physics’ enables our pragmatic engagement with the physical world and forms a key component of ‘common sense’ aspects of thought. Current artificial intelligence systems pale in their understanding of intuitive physics, in comparison to even very young children. Here we address this gap between humans and machines by drawing on the field of developmental psychology. First, we introduce and open-source a machine-learning dataset designed to evaluate conceptual understanding of intuitive physics, adopting the violation-of-expectation (VoE) paradigm from developmental psychology. Second, we build a deep-learning system that learns intuitive physics directly from visual data, inspired by studies of visual cognition in children. We demonstrate that our model can learn a diverse set of physical concepts, which depends critically on object-level representations, consistent with findings from developmental psychology. We consider the implications of these results both for AI and for research on human cognition.
Main
The field of artificial intelligence (AI) has made astonishing progress in recent years, mastering an increasing range of tasks that now include Atari video games 1 , board games such as chess and Go 2 , scientific problems including protein-folding 3 and language modelling 4 . At the same time, success in these narrow domains has made it increasingly clear that something fundamental is still missing. In particular, state-of-the-art AI systems still struggle to capture the ‘common sense’ knowledge that guides prediction, inference and action in everyday human scenarios 5 , 6 . In the present work, we focus on one particular domain of common-sense knowledge: intuitive physics, the network of concepts that underlies reasoning about the properties and interactions of macroscopic objects 7 . Intuitive physics is fundamental to embodied intelligence, most obviously because it is essential to all practical action, but also because it provides one foundation for conceptual knowledge and compositional representation in general 8 . Despite considerable effort, however, recent advances in AI have yet to yield a system that displays an understanding of intuitive physics comparable to that of even very young children.
To pursue richer common sense physical intuition in AI systems, we take, at multiple points in our work, inspiration from developmental psychology, where the acquisition of intuitive physics knowledge has been an intensive focus of study 9 , 10 , 11 , 12 . We build a deep-learning system that integrates a central insight of the developmental literature, which is that physics is understood at the level of discrete objects and their interactions. We also draw on developmental psychology in a second way, which relates to the problem of behaviourally probing whether an AI system (or in the case of developmental psychology, an infant or child) possesses knowledge of intuitive physics.
In developing behavioural probes for research on children, developmental psychologists have based their approach on two principles. First, that the core of intuitive physics rests upon a set of discrete concepts 11 , 13 (for example, object permanence, object solidity, continuity and so on) that can be differentiated, operationalized and individually probed. By specifically targeting discrete concepts, our work is quite different from standard approaches in AI for learning intuitive physics, which measure progress via video or state prediction 14 , 15 , 16 metrics, binary outcome prediction 17 , question-answering performance 18 , 19 or high reward in reinforcement learning tasks 20 . These alternative approaches intuitively seem to require an understanding of some aspects of intuitive physics, but do not clearly operationalize or strategically probe an explicit set of such concepts.
The second principle used by developmental psychologists for probing physical concepts is that possession of a physical concept corresponds to forming a set of expectations about how the future can unfold. If human viewers have the concept of object permanence 21 , then they will expect that objects will not ‘wink out of existence’ when they are out of sight. If they expect that objects will not interpenetrate one another, then they have the concept of solidity 22 . If they expect that objects will not magically teleport from one place to another but instead trace continuous paths through time and space, then they have the concept of continuity 11 . With this conceptual scaffolding, a method for measuring knowledge of a specific physical concept emerges: the violation-of-expectation (VoE) paradigm 21 .
Using the VoE paradigm to probe for a specific concept, researchers show infants visually similar arrays (called probes) that are either consistent (physically possible) or inconsistent (physically impossible) with that physical concept. If infants are more surprised by the impossible array, this provides evidence that their expectations, derived from their knowledge of the probed physical concept, were violated. In this paradigm, surprise is putatively measured via gaze duration, but see refs. 23 , 24 , 25 for further discussion. As an example, consider the concept of continuity (depicted in Fig. 1 ): objects trace a continuous path through time and space. For the possible probe (Fig. 1 , first row), researchers 26 showed an object moving horizontally behind a pillar, being occluded by that pillar, subsequently emerging from occlusion, and travelling towards a second pillar, where it was again occluded behind that pillar and emerged from occlusion one final time. In the impossible probe (Fig. 1 , third row), when the object is occluded by the first pillar, it does not emerge from occlusion immediately. Instead, after some delay, the object emerges from behind the second pillar - never appearing in the space between the two pillars and thus seeming to teleport from one pillar to the other. Experiments with infants have shown that by the age of 2.5 months, they gaze longer at an object that teleports between two screens than an object that moves continuously from one screen to the next 26 . This same strategy has been used by developmental researchers to accumulate strong evidence that infants acquire a wide range of distinct physical concepts 9 , 10 , 11 , 12 within the first year of life.
Fig. 1: Probes adapted from developmental psychology to assess the physical concept of continuity.
Example probes adapted from ref. 26 to assess the physical concept of continuity 11 , 69 : objects trace a continuous path through time and space. Each row corresponds to one temporally downsampled video in a probe tuple. Checkered backgrounds were used as a cue for depth and to introduce visual diversity of our stimuli. Actual videos consist of a total of 15 frames. The top two rows are physically possible probes and the bottom two rows are physically impossible probes. Physically possible probes: in the first physically possible probe (first row), a ball rolls behind two occluders. In the second possible probe (second row), no ball is present. Physically impossible probes: these probes are formed by splicing parts of the physically possible probes into impossible events. In the first impossible probe (third row), the ball rolls behind the first occluder and emerges from the second occluder, never appearing between the two occluders. The second impossible probe (fourth row) has the opposite structure: the ball appears between the two occluders, but was never seen rolling behind the first occluder or rolling out of the second occluder.
Full size image
Discussion
‘Common sense’ reasoning, in numerous varieties, has been an increasingly pressing objective for AI research. In the present work, we reported progress toward building deep-learning systems that construct, on the basis of perceptual experience, knowledge in a specific but fundamental common-sense domain: intuitive physics. In doing so, we drew inspiration from developmental psychology on two fronts. First, to measure progress systematically and quantitatively, we ported the VoE paradigm from developmental psychology to probe five different physical concepts. Second, to build a model capable of learning intuitive physics, we endowed our model with object-centric representation and computation directly inspired by accounts of infant intuitive physics. To train that model, we created a training dataset of composable physical interactions yielding complex three-dimensional (3D) scenes, the Physical Concepts dataset (also making this freely available online as a resource for other researchers). To test the utility of the object-based approach, we compared our model with tightly controlled, object-agnostic baselines. Finally, to test the generalization capabilities of our network, we evaluated our model on an externally defined dataset with novel appearances, shapes and dynamics without any retraining.
Our results centre on four key observations. First, our object-based model displayed robust VoE effects across all five concepts we studied, despite having been trained on video data in which the specific probe events did not occur. Second, consistent with expectations founded on the developmental literature, we found that the VoE effects seen in our model diminished or disappeared in well-matched models that did not employ object-centred representations. Third, we observed that robust VoE effects could be obtained with as little as 28 h of visual training data. Finally, we report that our model generalized to an independently developed dataset involving novel object shapes and dynamics.
Our work relates closely to, and also substantially extends, a number of previous studies in AI and computational cognitive science. Two other groups have built intuitive physics evaluation frameworks specifically using the VoE paradigm. Riochet et al. 28 , in research conducted in parallel with our own original work in this area 27 , presented a dataset for probing a variety of physical concepts, including a subset of those covered in our Physical Concepts dataset. Although the videos comprising this dataset include richer textures and lighting effects than our dataset, the physical events themselves are more restricted in their diversity. Smith and colleagues 29 introduced the ADEPT VoE dataset mentioned earlier, which is still simpler in design and eschews collision events. Neither of these preceding studies reported a system that successfully acquires specific physical concepts through learning: the model presented by Riochet and colleagues showed VoE effects only in a minority of conceptual categories tested, and the model presented by Smith and colleagues embeds a hand-engineered physics engine rather than learning physics from scratch.
Our work extends these previous work by introducing a new dataset involving a more dramatic separation between training and test events, combined with object dynamics more faithful to physics. Of course, despite these advances, the range of object and event types in our dataset remains narrow compared with those encountered in the real world. Increasing richness and ecological validity while also maintaining experimental control is a challenge in AI just as it is in psychology, but growing research of the present kind towards richer data domains is of course an important aspiration. Fortunately, the architecture we have introduced is neither tied to the particular object types and events contained in our dataset, as demonstrated in our ADEPT experiment, nor is there an a priori reason to expect that a similar computational approach cannot be extended to more naturalistic event types. Indeed, relational architectures related to PLATO have recently been employed in engineering research to learn the dynamics of quite complex physical systems 38 .
In presenting our approach, we have stressed the centrality of learning. In view of this, it is important to acknowledge that our model implementation was granted access to two sources of privileged information: object (segmentation) masks and object indices allowing consistent placement of object embeddings within the model over timesteps (tracking). Recent research has introduced methods for object segmentation and tracking that would allow each of these to be extracted from pure video data without access to privileged information of any kind. In the case of object masks, several recent papers have introduced methods for unsupervised object discovery (see refs. 39 , 40 for example). Recently, methods for learning to track objects consistently over time have been reported 41 , 42 . Integrating these methods into our framework to allow true ‘learning from scratch’ represents an appealing goal for next-step development. In the meantime, it is worth noting that the mere availability of object-level segmentation and tracking information does not fully explain the simulation results we have presented. In our training-set size analysis (Fig. 6) , the model without any training at all (‘0’) failed to produce any VoE effects. In a related simulation work also focusing on VoE effect in intuitive physics, a model that was given segmentation masks but which did not carry out relational object-level processing failed to produce VoE effects 28 .
The advantages of object-level representation observed in the present work echo related findings from other areas of AI research, including question-answering 43 and reinforcement learning 20 , 44 , 45 , 46 , 47 , where object-level representation has been found to support faster learning and improve transfer. What is the explanation for the benefit of object-level representation in the problem setting we have studied? We speculate that the advantage derives from the role that object-level representation can play as a regularizer. In a machine-learning context, regularization refers to the inclusion of constraints on a learning process, which bias that process to place greater weight on particular inferences from training data 48 . Regularization, in this sense, is employed to prevent a learning system from ‘overfitting’ to training data, learning representations that are so closely tied to the specifics of that data that the system fails to identify more general regularities that would allow it to perform correctly on new data at test. A bias towards object-level representation could have this kind of regularizing impact on visual learning, in the sense that it biases the learning system towards a compositional representation of visual events which corresponds to the actual compositional structure of the physical events themselves, resulting in knowledge that generalizes well to new events. While this explanation is speculative, it aligns with one informal observation from preliminary modelling work not described above. In this preliminary modelling, we trained object-based and flat models on videos in which the depicted events were much closer in form to those presented at test in the VoE probes. Under these conditions, the difference in VoE effects between object-based and flat models was considerably smaller than what we observed in the simulations that we report in the present paper, which featured much greater separation between training and testing data. It is in this setting, where a learning system must transfer at test to data that lies reasonably far from any particular training example, that regularization can be decisive. Not only does this regularization have the potential to make an intractable task tractable, but it may also make models more data-efficient. Where deep-learning models have often been criticized as being too data inefficient 5 , our results demonstrate that a relatively small amount of visual experience—on the order of tens of hours—is sufficient to engender robust VoE effects in response to violations of physics. Exploring this regularization-based interpretation of the role of object-based representation thus stands as an appealing hypothesis for further investigation. In this connection, it is worth noting that recent work in AI has increasingly favoured computational architectures (for example, graph nets and tranformers) that implement an inductive bias towards relational, compositional processing 49 .
Throughout the present work, we have emphasized the role of insights from developmental psychology in guiding the research we have reported. What, if any, are the implications of our work for developmental psychology itself? This topic must be approached with some care, since the model that we have presented is not intended to provide a direct model of physical-concept acquisition in children. Nevertheless, we do think there are several insights that may be germane to developmental science. First, our modelling work provides a proof-of-concept demonstration that at least some central concepts in intuitive physics can be acquired through visual learning. Although research in some precocial species suggests that certain basic physical concepts can be present from birth 50 , in humans the data suggest that intuitive physics knowledge emerges early in life 31 but can be impacted by visual experience 51 , 52 . Of course there is extensive debate and legitimate uncertainty about innateness 11 , 53 . Our modelling work complements these conclusions from experimental research by demonstrating the sufficiency of visual learning to explain the emergence of one core set of physical concepts, assuming—as developmental psychology led us to anticipate—that object-level representations are available. Furthermore, our model reflects just one point in a family of possible models that implement proposed algorithmic-level 54 principles (for example, tracked objects) suggested by developmental psychologists. Exploring different ways of implementing these principles (for example, replacing our object buffer with a recurrent object tracker or removing recurrence from the dynamics predictor; see Supplementary Fig. 12) presents a natural avenue for future work to make more precise contact with developmental accounts at the implementation level 54 .
It would be fascinating to extend the present modelling work to make even more direct contact with key questions in developmental psychology. For example, the order of concept acquisition throughout development has been of central interest to developmental psychology 55 . Although we do not model developmental trajectories in the present work, there is a long history of using connectionist models to shed light on stage-like developmental processes, including in the domain of intuitive physics 56 . Recent work 57 is extending the VoE effect to neurophysiological measurements, potentially opening up new possibilities for testing and constraining models of the kind we have proposed here. Bringing the present work to bear on these questions poses an obvious research opportunity. This would be especially promising in combination with newly emerging datasets that capture first-person footage of infant experience at large scale 58 , 59 . This represents just one opportunity that our model might offer as a tool for computational research into the origins of intuitive physics in human development. A broader range of possibilities follows from the wider class of relational AI systems mentioned briefly above 49 . Whereas the present work has explored the utility of relational, object-based processing for understanding intuitive physics, we speculate that relational mechanisms currently being explored in AI may be useful for understanding human knowledge in other ‘core domains’ studied in developmental psychology 12 , perhaps most interestingly the interactions with animate agents.
Methods
Training dataset, ‘freeform’ physical events
Our training dataset consists of 300,000 procedurally generated scenes using custom Python code and the Mujoco physics engine. Our motivation was to build a dataset for training that encompassed a wide range of complex physical interactions. Thus, we built composable scene building blocks: rolling, collisions along the ground plane, collisions from throwing or dropping an object, occlusions (via a ‘curtain’ that descends from the top of the screen and retracts), object stacks, covering interactions (an open-bottom, closed-top container falls onto an object), containment events (an object falls into an open-top container) and rolling up/down ramps. Each building block used randomly generated values for object appearances and locations, while ensuring the main intended physical interaction still occurred.
To form a scene, we chose two to four scene building blocks. Building blocks compose by advertising and/or consuming locations of interest. For example, ‘rolling’ building blocks have two locations of interest: the rolling object’s starting point and the point it rolls towards. An ‘object stack’ building block has a single location of interest: the centre of the object stack. Composing these two building blocks can yield a scene where a sphere rolls at the object stack. Alternatively, the sphere can start off at the top of the object stack and roll off of it. With some probability, building blocks do not compose and inhabit the same scene independently.
We restricted the primitive shapes in our dataset to rectangular prisms and spheres. From the rectangular prisms we built a ‘curtain,’ a ramp, an arch, and both open-top and closed-top containers. Object sizes were chosen in a hand-crafted range to ensure that objects were visible. Object colours and the checkerboard floor were chosen as random red-green-blue (RGB) values. Objects had varying but stereotyped masses. Rolled objects and objects in an object stack had a mass of 10. Dropped or thrown objects were made four times heavier so that they could more easily displace objects they hit. Containers had a mass of 4 or 5. Arches had a total mass of 60 to keep them upright. We did not withhold any object shapes from the training data for use in the test probes. Learning to generalize the dynamics of arbitrary shapes would be its own research programme with significant challenges. We could sidestep this issue by creating probes that employ novel shapes but endow them with familiar dynamics: a chess piece that glides across a frictionless floor would have the same lateral trajectory as one of our rolling balls. However, this would be a contrived version of physics that could only provide a superficial test of generalization to new objects. Each scene unfolded over 3,000 simulation steps and was rendered into 15 frames at a resolution of 64 × 64 RGB pixels. Each frame had a corresponding segmentation mask rendered at the same resolution where each object in the scene, including the floor, was given a unique and consistent value. To add extra complexity, we used a drifting camera. The position and orientation of the camera was the same at the start of all scenes. From there the camera randomly drifted its position and location (subject to a bounding box to ensure the camera was still looking at relevant parts of the scene). For each of the 15 frames in a video, we exported the camera position and orientation and gave this information to our model. We chose a checkered background to provide a depth cue for the relative positions of objects as they and the camera position moved throughout a video. We added this for visual diversity and to make our results comparable to that of ref. 27 . Finally, we built an additional 5,000 ‘freeform’ scenes with this same structure to use as a test set during hyperparameter selection.
VoE paradigm
As described earlier, to assess the model’s knowledge of specific physical concepts, we leverage the VoE paradigm from the developmental literature and adapted for artificial models 27 , 28 in parallel and subsequently 29 . We generated 5,000 probe scenarios for each of the following physical concepts adapted from developmental psychology: object persistence, continuity, unchangeableness, directional inertia and solidity (described below). Whereas the training dataset had a freely moving camera, we used a stationary camera for the probes to ensure objects were (where applicable) occluded during the transition from physical to aphysical frames. We computed surprise as the model’s prediction errors over the course of a video. Although we optimized the loss in the object code space, for evaluation we looked at the error in pixel space. We can take this error in pixel space by decoding object codes to images using the ComponentVAE’s decoder, Θ.
$$surprise(video)=\mathop{\sum }\limits_{t=1}^{T-1}\mathop{\sum }\limits_{k=1}^{K}{({{\Theta }}({z}_{k}^{t+1})-{{\Theta }}({\hat{z}}_{k}^{t+1}))}^{2}.$$
we took this loss in pixel space, instead of in object code space which we used for training, due to the high variance of object codes representing empty ‘objects’. This arises because the ComponentVAE emits a fixed number (K = 8) of output components (‘objects’). If there are fewer than 8 objects in the image, then the ComponentVAE outputs ‘objects’ which correspond to blank pixels. As these empty objects were fairly prevalent in our scene (most videos did not contain the maximum of 8 objects in a scene), the learned representational code uses the unit Gaussian (mean = 0, variance = 1) to represent empty components. To represent components with actual objects in them, the ComponentVAE shifts the mean from 0 and decreases the variance (Supplementary Fig. 18) . Because of the high variance on empty objects, it is relatively difficult to predict the exact value of an empty slot across timesteps. By taking pixel loss, we avoided this issue as any of the values that encode an empty object all decode to a blank image. Furthermore, because the possible and impossible videos within a probe scenario contained the same objects, we avoided issues that typically arise when using pixel loss. For example, in pixel space, predictions of large objects carry more weight than predictions of small objects, but within a probe scenario the objects in the possible and impossible videos are perfectly matched.
Physical concept ‘object persistence’
Perhaps the most fundamental aspect of intuitive physics is understanding that objects cannot disappear from existence. This is referred to as ‘object permanence’. The principle of ‘object persistence’ 10 extends this to say that “objects persist, as they are, in time and space”. Taking inspiration from a classic behavioural experiment 13 on object permanence, probes for this category involved a rigid plank falling on an object. In the possible probe (Supplementary Fig. 1) , when the plank fell on the object, the plank occluded it while also remaining propped up by it as expected. By contrast, in the impossible probe the plank fell on top of the object (in a manner that is initially identical) but ended up flat on the floor, as if the object had disappeared. In the counterbalanced probes, the possible probe had the plank falling flat on the floor in an otherwise empty scene, and the impossible probe had the plank falling in the same empty scene but ended up inexplicably propped up by an item that was made to appear under the plank while it occluded part of the floor. We flagged this as a test of object persistence because not only must the object continue to exist while being occluded, but it must also retain its properties. For example, if the object shrank while occluded, then the plank could occupy the space it occupied without violating physics.
Physical concept ‘unchangeableness’
By the principle of ‘unchangeableness’ 9 , objects tend to retain their features (for example, colour, shape) over time. In the possible probes of this dataset (Supplementary Fig. 3) , a random assortment of static objects were aligned in the foreground. A screen was lowered in front of those objects, and was then raised.
The concept of ‘unchangeableness’ relates to a number of different aspects of objects, and therefore in the impossible probes we swapped the positions of objects when they were behind the curtain to suggest that their position, colour or shape had changed.
The developmental study that inspired our event design 30 used separate occluders for each object. To simplify procedural event generation, our probe videos involved a single occluder spanning all objects. It should be noted that, in contrast to the developmental study, the ‘impossible’ events in our videos were therefore susceptible to an alternative interpretation, whereby objects traded places through self-propelled motion. In the developmental literature, this would be classified not as a violation of unchangeableness, but instead as a detection of (implausible) self-propelled motion in animate objects 60 .
Physical concept ‘continuity’
The concept that an object traces out one continuous path through space and time is referred to in the developmental literature as ‘continuity’ 31 . For videos (Fig. 1) in this category, we used a nearly identical setup to a classic experiment 26 where possible probes began with two static pillars separated by a gap. A ball was rolled horizontally behind both pillars so that it was visible before, between and after the pillars during its trajectory. In the impossible probes, while the ball was occluded by the first pillar, the ball was made invisible for the period when it would be between the pillars, and then reappeared after the second pillar. Because the size of the pillars matched the size of the ball, it was not possible that the ball was simply stopped behind the first pillar. Alternatively, we made the ball only visible when it was rolling between the pillars, but not before or after. The only way this could be possible would be if there were self-propelled balls, which never appeared in our dataset. Note that the splicing procedure for probes for this concept deviated slightly from what is shown in Fig. 4 . However, we still maintained the image-level and pairwise matching of frames across possible and impossible videos in a probe tuple.
Physical concept ‘solidity’
This dataset is an adaptation of an experiment that uses an object and an occluder 22 to test understanding of the solidity of objects, as related to the penetration of an object through a container and the ground below. Although originally situated in an analysis of the differences between infant cognition in occlusion versus containment events, it nonetheless relies on the concept of solidity. In probes, perspective was carefully controlled such that the camera can view inside the top of the container but not the bottom. In possible probes (Supplementary Fig. 2) , a rectangular block was dropped into the container and came to rest as expected. In the impossible probes, the object ‘fell through’ the container and the floor, and therefore disappeared from view (with the penetration itself occluded by the face of the container) even if the object should have remained visible due to its height. To allow for the bias-free splicing procedure, we had an alternate impossible probe condition which related to solidity in a different way: the object remained visible at the top of the container when its height clearly dictated that it should have fallen further into the container. In this case, the violation is that objects can only rest upon solid surfaces. In terms of developmental theory, this relates closely to what have been termed the inertia and gravity principles 61 .
Physical concept ‘directional inertia’
This dataset is a loose adaptation of a paradigm developed by Spelke et al. 62 for investigating infant knowledge of inertia. Where the classic paradigm investigated both magnitude and directional violations of the principle of inertia under occlusion, the current probe only tests the directional component in an unoccluded fashion. The altered version was chosen to allow for the bias-free splicing procedure. Furthermore, we wanted to include a test involving collisions (which is a common, albeit in 2D not 3D, domain for learning physical dynamics with deep learning; for example, ref. 63 ). Possible probes (Supplementary Fig. 4) were formed by rolling a ball at an angle towards a heavy block. Upon contact, the sphere rolled away from the block while reflecting its velocity about the angle of incidence (as expected). The paired possible probe reversed this trajectory: it began at the end point of the first possible probe, rolled at the same location on the block and bounced off the block to end up at the initial position of the first probe. Each impossible probe was formed by swapping the trajectories of the impossible probes at the point where they made contact with the block. The effect of this swap was that when the ball hit the block, instead of reflecting about the angle of incidence, it headed back towards its initial location, clearly violating the principles of directional inertia for colliding objects.
Model architecture
To implement our object-centric approach, our model used two main components: a feedforward perceptual module and a recurrent dynamics predictor.
The perceptual module took as input a 64 × 64 RGB image and 64 × 64 segmentation mask consisting of n = 8 channels and produced an object code for each (potentially empty) channel in the segmentation mask through unsupervised representation learning. Although not provided as inputs or targets to the perceptual module, we found that some of the axes of the code represented interpretable, static properties of the objects such as position on the ground plane or above the ground, height, width and different types of shape transformations (see Figs. 19 – 23 in Supplementary Material for traversals of this learned latent space). These object codes were fed into the dynamics predictor, which was then tasked with predicting the object codes at the next timestep. Instead of directly predicting the next image in the video, the dynamics model predicted per-object percepts.
Perception module
To learn these object codes, we used ComponentVAE 39 . We used a convolutional neural network for the encoder Φ and a spatial broadcast decoder θ 64 . Where that work learned segmentation masks using an attention network, we used ground truth segmentation masks from the dataset. For a given image x and a segmentation mask m with K = 8 channels (representing a maximum of 8 objects in any scene), the ComponentVAE yielded K 16-dimensional Gaussian posterior distributions qϕ(zk∣xk, mk) (where xk is the masked image). The sample from this code, zk, was the object code for the kth object. The decoder took as input zk to reconstruct masked object image and mask: \({p}_{\theta }({\hat{x}}_{k},{\hat{m}}_{k}| {z}_{k})\). Note that similar to ref. 39 , we treated the ground plane as its own object with its own set of parameters in the encoder and decoder. Additionally, we always output 8 codes that were used by the dynamics predictor. Where there were fewer than 8 objects in the image (some of the mask channels were empty), the ‘extra’ slots were tasked with reconstructing an empty object image and mask.
During an offline phase, before training the dynamics predictor, we optimized the variational objective as defined by ref. 39 to learn the encoder and decoder parameters. We set β to 0.1 and γ to 10 to ensure that the model reconstructed the object masks with high fidelity. We found that even at this relatively low value of β, the model learned disentangled representations 65 for each object, sometimes corresponding to factors such as position, shape, colour and size (see Figs. 19 – 23 in the Supplementary Information ).
We used the pretrained ComponentVAE in two ways during the prediction task. First, to actually obtain the object codes, we fed in a segmented image into the ComponentVAE encoder to get the per-object posterior distributions. The object code was then obtained by taking a sample from each posterior distribution. Although it might seem more natural to use the mean instead of a sample, we found better results using the sample on our training data. Thus, each object code was a 16-dimensional vector in a learned representational space. Second, we used the ComponentVAE’s decoder to map the object codes predicted by our model to per-object images (which were easily composed to form a composite image). This allowed us to visualize the model’s predictions.
Dynamics predictor
Where the perception module captured static object properties, our dynamics predictor needed to integrate information over time (for example, to calculate velocities or remember occluded objects). Thus, we employed a recurrent module for the dynamics predictor. We developed the ComponentLSTM, which is an objectwise LSTM (with 2,056 hidden units) with shared weights ψ, but object-specific activations. At each timestep, the LSTM for the kth object computes the following:
$${\hat{z}}_{k}^{t+1},\mathrm{cel{l}}_{k}^{t},\mathrm{hidde{n}}_{k}^{t}=\mathrm{LSTM}({z}_{k}^{1:t},\mathrm{cel{l}}_{k}^{t-1},\mathrm{hidde{n}}_{k}^{t-1};\psi ),$$
where the LSTM function is shorthand for the standard LSTM 66 updates and \({\hat{z}}_{k}^{t+1}\) is the model’s prediction for object code k in the next image. The \({z}_{k}^{1:t}\) here is the object buffer: it is the set of all previously seen object codes for the kth object. Although in principle we could feed in just the the most recent object codes, we found much better results on our training data when using the full history (see ‘Hyperparameter Selection’ in Supplementary Information for motivation on this and other model hyperparameter choices). Because the segmentation masks were consistently ordered throughout the course of a video, the perceptual module’s outputs were also in a consistent ordering. Thus, when we aggregated object codes into the object buffer and fed them into the corresponding slot in the LSTM to make a prediction for the corresponding object, we endowed our model with object tracking.
Thus far, we have not equipped the dynamics predictor with a mechanism for computing the influence of objects on each other. To do so, we leveraged an Interaction Network 63 . We computed pairwise interactions from the LSTM cell states to both the cell states (using multi-layer perception (MLPρ)) and the object buffer inputs (using MLPλ). Each MLP consisted of three layers with 512 units and the Gaussian Error Linear Unit activation function 67 . Briefly (see ‘Computing Interactions’ in Supplementary Information for details), for the kth memory slot in the LSTM, \(\mathrm{cel{l}}_{k}^{t-1}\), we compute:
$$\mathrm{in{t}}_{k}=\mathrm{InteractionNetwork}(\mathrm{from}=\mathrm{cel{l}}_{k}^{t-1},\mathrm{to}=[\mathrm{cel{l}}_{1:K}^{t-1};{z}_{1:K}^{1:t}]).$$
To aggregate the interactions across all pairs for the kth slot, we computed the sum and the maximum of the interactions and concatenated these values together. We added the corresponding interactions as an additional input to the kth LSTM, yielding the final update computation:
$${\hat{z}}_{k}^{t+1},\mathrm{cel{l}}_{k}^{t},\mathrm{hidde{n}}_{k}^{t}=\mathrm{LSTM}(\mathrm{concat}({z}_{k}^{1:t},\mathrm{in{t}}_{k}),\mathrm{cel{l}}_{k}^{t-1},\mathrm{hidde{n}}_{k}^{t-1};\psi ).$$
we call the combination of an InteractionNetwork and ComponentLSTM an ‘InteractionLSTM’. Additionally, we included a residual connection from the kth object code for the current timestep \({z}_{k}^{t}\) to the model’s prediction for the kth object (once again exploiting aligned/tracked objects).
Finally, recall that the training data included a randomly drifting camera. Thus, we fed the ‘InteractionLSTM’ with viewpoint information as additional inputs. First, the interaction network received information about the current camera position. Additionally, the ComponentLSTM received information about the camera position at the next timestep. In this way, the dynamics predictor was not only tasked with predicting dynamics, but also predicting dynamics from a specific viewpoint.
We trained the dynamics predictor as a next-step predictor on the sequence of T = 15 VAE-compressed images for each video using teacher-forcing. At each timestep t, we converted an input segmented image to a set of object codes \({z}_{1:K}^{t}\), using the perceptual module’s encoder. We formed the object buffer \({z}_{1:K}^{1:t}\) by concatenating these object codes for all previously seen images from the video. To form a fixed-length vector across the sequence, we padded the object histories with zeros where necessary. Before feeding the object buffer into any downstream computations, we projected them into a 1,680-dimensional space using the Exponential Linear Unit activation function 68 . Camera information and the object buffer were fed into the InteractionLSTM to yield object code predictions \({\hat{z}}_{1:K}^{t+1}\). We optimized the sum of the mean-squared error between our predictions and the observed object codes at the next timestep:
$${{{{\mathcal{L}}}}}_{\mathrm{objectcode}}=\mathop{\sum }\limits_{t = 1}^{T-1}\mathop{\sum }\limits_{k = 1}^{K}{({z}_{k}^{t+1}-{\hat{z}}_{k}^{t+1})}^{2}$$
Baseline models
Where PLATO used a slotted perceptual module—each slot encoded a distinct object—the flat baseline models encoded the entire scene in a single slot. This translated naturally into the dynamics predictor which made predictions based off of a single slot. The updates in the flat models were identical to the PLATO update equations above, with the exception that K = 1 instead of K = 8. Because there was only one slot, this meant that the flat module did not get any information in the segmentation mask (it had only a single channel and was uniformly filled to 1). However, see ref. 28 for evidence that non-object-based models trained with segmentation masks still fail intuitive physics examinations.
Experiments
Training
To train the perception module, we used the RMSProp optimizer, with a learning rate of 1 × 10−4 for 1,000,000 steps with a batch size of 64 images. To train the dynamics predictor, we trained our models for a total of 1,300,000 training steps with a batch size of 128 videos. We used a learning rate that transitioned from 1 × 10−4 to 4 × 10−5 after 300,000 training steps. The above training procedure was used for all models and all experiments.
PLATO vs flat models
For each model, we pretrained the perception module on the full dataset of 4,500,000 images (300,000 videos each with 15 frames of content). We then trained the dynamics predictor, freezing the weights of the perception module, using five different random seeds for weight initialization. For each seed of the dynamics predictor, we ran 5,000 probe quadruplets for each of the five concepts.
Training set size
As above, we pretrained a perceptual module on the full dataset size of 300,000 videos (4,500,000 images). We then varied the dataset size used to train the dynamics predictor using three random seeds while freezing the weights of the perception module. Training the dynamics predictor proceeded as above regardless of dataset size. We also included a ‘0’ training set size which corresponded to a model with an untrained dynamics predictor. This model still received object percepts from the pretrained perceptual module. The lack of VoE effects for the ‘0’ model shows that the VoE effects depend on more than just tracked object representations. Predicting dynamics is necessary for VoE effects on the five physical concepts probed.
Visual experience calculation
To compute the visual experience contained in our dataset, we calculated the following: 300,000 videos each lasting 2 s yielded 600,000 s or roughly 6.95 d of continuous visual experience. For only 50,000 videos, this amounts to roughly 28 h of continuous experience. Conservatively assuming 8 h of wakeful experience in a day yields an equivalent of roughly 20.9 d for 300,000 videos or 3.5 d for 50,000 videos.
Generalization test using ADEPT dataset
To measure our model’s ability to generalize, we tested PLATO on videos mined from the ADEPT dataset 29 . The dataset contains procedurally generated probes (scenarios in ADEPT) of intuitive physics, with objects entirely different from those of our own dataset. Each example in the dataset contains RGB images and corresponding object masks. The probes are organized into eight types. Similar to our probe quadruplets, ADEPT provides sets of possible and impossible videos for each probe type, although they do not use the same splicing procedure to carefully control for image-level differences between possible and impossible videos.
We had to make two changes to the ADEPT probes to apply PLATO. The first class of changes was cosmetic: we applied cropping and downsampling to the videos to make the videos match PLATO’s expected input size. Second, we had to confront the fact that PLATO expects input masks to be in a consistent ordering throughout a video. However, ADEPT only provides aligned input masks for one probe type: overturn short. Through a hard-coded, type-specific procedure, we were able to manually align two additional probe types: overturn long and block. In this way, we were able to ‘mine’ three probe types from the ADEPT dataset that span the concepts of object permanence, solidity and continuity. This served as a useful measure of PLATO’s generalization capabilities, but importantly cannot be compared directly to the ADEPT results.
First, we applied superficial changes to match the image size and video length expected by PLATO. We centrally cropped the ADEPT dataset from 320 × 480 pixels to 320 × 320 pixels. We manually verified on a subset of probes that this still displayed the relevant parts of the probe displays. Subsequently, we downsampled from 320 × 320 pixels to 64 × 64 pixels to match PLATO’s expected input size. Finally, PLATO expects to operate over a video of 15 frames, but ADEPT had much higher temporal resolution and videos which varied in length by probe type. Thus, for each of the different probe types, we changed the downsampling factor to always yield 15 frames: block was downsampled by a factor of 12, overturn short by a factor of 8 and overturn long by a factor of 15. Again, we manually verified on a subset of examples that the downsampled videos retained relevant events of the possible and impossible probes.
We evaluated PLATO on our modified versions of the overturn short, overturn long and block probes without any retraining of our model. Where our dataset contained probe quadruplets (two physically possible videos and two physically impossible videos), ADEPT contains pairs of videos: one physically possible and one physically impossible. This allowed us to compute essentially the same metrics used for our dataset: mean relative surprise and average accuracy. The only difference was that we ‘summed’ over a single video instead of two videos. We performed this evaluation for five seeds of our model trained with our full training dataset to produce Fig. 7 .
Reporting summary
Further information on research design is available in the Nature Research Reporting Summary linked to this article.
Data availability
Instructions on accessing the physical concepts dataset are available at https://github.com/deepmind/physical_concepts
Code availability
Our implementation of PLATO is not externally viable, but the corresponding author may be contacted for any clarifying questions on implementation details for building PLATO.
References
Mnih, V. et al. Playing atari with deep reinforcement learning. Preprint at https://arxiv.org/abs/1312.5602 (2013).
Silver, D. et al. Mastering chess and shogi by self-play with a general reinforcement learning algorithm. Preprint at https://arxiv.org/abs/1712.01815 (2017).
Senior, A. W. et al. Improved protein structure prediction using potentials from deep learning. Nature 577, 706–710 (2020).
Smith, B. C. The Promise of Artificial Intelligence: Reckoning and Judgment (MIT Press, 2019).
Michotte, A. The Perception of Causality (Basic Books, 1963).
Johnson, M. The Body in the Mind: the Bodily Basis of Meaning, Imagination, and Reason (Univ. Chicago Press, 2013).
Baillargeon, R. & Carey, S. in Early Childhood Development and Later Outcome (ed. Pauen, S.) 33–65 (Cambridge Univ. Press, 2012).
Baillargeon, R. Innate ideas revisited for a principle of persistence in infants’ physical reasoning. Perspect. Psychol. Sci. 3, 2–13 (2008).
We visualize ‘rollouts’ of PLATO’s predictions. Rollouts are generated by feeding PLATO-encoded input frames from the dataset. Halfway through the video, we start feeding PLATO’s outputs as inputs to simulate multi-step predictions.
Rights and permissions
Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The images or other third party material in this article are included in the article’s Creative Commons license, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons license and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this license, visit http://creativecommons.org/licenses/by/4.0/ .
About this article
Cite this article
Piloto, L.S., Weinstein, A., Battaglia, P. et al. Intuitive physics learning in a deep-learning model inspired by developmental psychology. Nat Hum Behav (2022). https://doi.org/10.1038/s41562-022-01394-8

Images Powered by Shutterstock