Meta Reality Labs Research
SceneScript is a method for representing and inferring scene geometry
using an auto-regressive structured language model, and end-to-end learning.
What is it for?
SceneScript allows AR & AI devices to understand the geometry of physical spaces
Shown on footage captured by Aria glasses, SceneScript can take visual input and estimate scene elements, such as walls, doors, or windows.
Shown on Meta Quest, scene elements predicted by SceneScript can be arbitrarily extended to include new architectural features, objects, and even object decompositions.
SceneScript jointly estimates room layouts and objects from visual data using end-to-end learning
End-to-end learning avoids the need for fragile ‘hard-coded’ rules.
SceneScript is provided visual information in the form of images or point cloud from an egocentric device.
SceneScript encodes the visual information into a latent representation, which describes the physical space.
SceneScript decodes the latent representation to a concise, parametric, and interpretable language, similar to CAD.
A 3D interpreter can convert the language to a geometric representation of the physical space.
Flexible and extensible model design
Trained in Simulation
SceneScript is trained using Aria Synthetic Environments, a fully simulated dataset consisting of 100,000 unique interior environments, each simulated with the same camera characteristics of Project Aria.
The Aria Synthetic Environments dataset was made available to academic researchers last year, along with a public research challenge, to accelerate open research in this area.
Adaptable to new scene representations
Because SceneScript both represents and predicts scenes using pure language, the scene elements can be easily expanded by expanding the language used to describe the simulated data.
Unlike traditional rule-based approaches for scene reconstruction, this means there is no need to train new 3D detectors for different objects or scene elements.
Enables LLMs to reason about physical spaces
SceneScript leverages the same method of next-token prediction as large-language models. This provides AI models the vocabulary needed to reason about physical spaces.
This advancement could ultimately unlock the next-generation digital assistants, by providing the real-world context necessary to answer complex spatial queries.
SceneScript model weights are now available for research
If you are a researcher in AI or ML research, access the SceneScript model weights here.
By submitting your email and accessing the SceneScript model, you agree to abide by the dataset license agreement and to receive emails in relation to SceneScript and the Aria Simulated Environments dataset.
Learn more about SceneScript
For more information about the SceneScript resesarch, read our paper on arXiv and watch the supplementary video.
Frequently Asked Questions
Because SceneScript is trained in simulation, no real-world data was used for training the model. To ensure that SceneScript works as expected for real-world scenes, the model was validated in fully consented environments.
The base pointcloud encoder and decoder each comprise approximately 35M parameters, resulting in a total of around 70M parameters.
The model is trained until convergence for about 200k iterations resulting in a total of ~3 days.
At time of release, SceneScript is not used on Quest, but is a research project from Reality Labs Research.
The model has been trained exclusively on synthetic indoor scenes and hence an inference on outdoor scenes may result in unpredictable outputs.
The model is coarsely segmented into an encoder and decoder. The encoder consists of a series of 3D sparse convolution blocks pooling a large point cloud to a small number of features. Subsequently, a transformer decoder autoregressively generates tokens by leveraging the encoder's features as context for cross-attention.
Using a vanilla, non-optimized transformer featured directly in PyTorch, decoding 256 tokens, equivalent to a medium-sized scene containing walls, doors, windows, and object bounding boxes, requires approximately 2-3 seconds.
No, not at the moment as the current model is trained using sequences that simulate what would be captured on Project Aria glasses. However, the model could be finetuned using a different camera model with different kind of lenses.
Yes, the model weights for SceneScript were made available for academic researchers in September 2024.
Acknowledgements
Research authors
Armen Avetisyan, Chris Xie, Henry Howard-Jenkins, Tsun-Yi Yang,
Samir Aroudj, Suvam Patra, Fuyang Zhang, Duncan Frost, Luke Holland, Campbell Orme,
Edward Miller, Jakob Engel, Richard Newcombe, Vasileios Balntas.
Stay in the loop with the latest news from Project Aria.
By providing your email, you agree to receive marketing related electronic communications from Meta, including news, events, updates, and promotional emails related to Project Aria. You may withdraw your consent and unsubscribe from these at any time, for example, by clicking the unsubscribe link included on our emails. For more information about how Meta handles your data please read our Data Policy.