Research
Block and Detail: Scaffolding Sketch-to-Image Generation
[
paper][
Code]
Vishnu Sarukkai, Lu Yuan*, Mia Tang*, Maneesh Agrawala, Kayvon Fatahalian. UIST 2024 (Oral)
We introduce a novel sketch-to-image tool that aligns with the iterative refinement process of artists. Our tool lets users sketch blocking strokes to coarsely represent the placement and form of objects and detail strokes to refine their shape and silhouettes. We develop a two-pass algorithm for generating high-fidelity images from such sketches at any point in the iterative process. In the first pass we use a ControlNet to generate an image that strictly follows all the strokes (blocking and detail) and in the second pass we add variation by renoising regions surrounding blocking strokes. We also present a dataset generation scheme that, when used to train a ControlNet architecture, allows regions that do not contain strokes to be interpreted as not-yet-specified regions rather than empty space. We show that this partial-sketch-aware ControlNet can generate coherent elements from partial sketches that only contain a small number of strokes. The high-fidelity images produced by our approach serve as scaffolds that can help the user adjust the shape and proportions of objects or add additional elements to the composition. We demonstrate the effectiveness of our approach with a variety of examples and evaluative comparisons.
Learning to Place Objects into Scenes by Hallucinating Scenes around Objects
Lu Yuan, James Hong, Vishnu Sarukkai, and Kayvon Fatahalian, NeuRIPS workshop 2024.
The ability to modify images to add new objects into a scene stands to be a powerful image editing control, but is currently not robustly supported by existing diffusion-based image editing methods. We design a two-step method for inserting objects of a given class into images that first predicts where the object is likely to go in the image and, then, realistically inpaints the object at this location. The central challenge of our approach is predicting where an object should go in a scene, given only an image of the scene. We learn a prediction model entirely from synthetic data by using diffusion-based image outpainting to hallucinate novel images of scenes surrounding a given object. We demonstrate that this weakly supervised approach, which requires no human labels at all, is able to generate more realistic object addition image edits than prior text-controlled diffusion-based approaches. We also demonstrate that, for a limited set of object categories, our learned object placement prediction model, despite being trained entirely on generated data, makes more accurate object placements than prior state-of-the-art models for object placement that were trained on a large, manually annotated dataset.
James Hong, Lu Yuan, Michaël Gharbi, Matthew Fisher, and Kayvon Fatahalian, AAAI 2024.
How to frame (or crop) a photo often depends on the image subject and its context; e.g., a human portrait. Recent works have defined the subject-aware image cropping task as a nuanced and practical version of image cropping. We propose a weakly-supervised approach (GenCrop) to learn what makes a high-quality, subject-aware crop from professional stock images. Unlike supervised prior work, GenCrop requires no new manual annotations beyond the existing the stock image collection. The key challenge in learning from this data, however, is that the images are already cropped and we do not know what regions were removed. Our insight is combine a library of stock images with a modern, pre-trained text-to-image diffusion model. The stock image collection provides diversity and its images serve as pseudo-labels for a good crop, while the text-image diffusion model is used to out-paint (i.e., outward in-painting) realistic, uncropped images. Using this procedure, we are able to automatically generate a large dataset of cropped-uncropped training pairs to train a cropping model. Despite being weakly-supervised, GenCrop is competitive with state-of-the-art supervised methods and significantly better than comparable weakly-supervised baselines on quantitative and qualitative evaluation metrics.
LC-NeRF: Local Controllable Face Generation in Neural Randiance Field
[
paper]
Wenyang Zhou, Lu Yuan, Shuyu Chen, Lin Gao, and Shimin Hu. IEEE Transactions on Visualization and Computer Graphics (Accepted for Publication).
3D face generation has achieved high visual quality and 3D consistency thanks to the development of neural radiance fields (NeRF). Recently, to generate and edit 3D faces with NeRF representation, some methods are proposed and achieve good results in decoupling geometry and texture. The latent codes of these generative models affect the whole face, and hence modifications to these codes cause the entire face to change. However, users usually edit a local region when editing faces and do not want other regions to be affected. Since changes to the latent code affect global generation results, these methods do not allow for fine-grained control of local facial regions. To improve local controllability in NeRF-based face editing, we propose LC-NeRF, which is composed of a Local Region Generators Module and a Spatial-Aware Fusion Module, allowing for local geometry and texture control of local facial regions. Qualitative and quantitative evaluations show that our method provides better local editing than state-of-the-art face editing methods. Our method also performs well in downstream tasks, such as text-driven facial image editing.