openai / glide-text2im Goto Github PK

GLIDE: a diffusion-based text-conditional image synthesis model

License: MIT License

Python 82.07% Jupyter Notebook 17.93%

glide-text2im's Introduction

GLIDE

This is the official codebase for running the small, filtered-data GLIDE model from GLIDE: Towards Photorealistic Image Generation and Editing with Text-Guided Diffusion Models.

For details on the pre-trained models in this repository, see the Model Card.

Usage

To install this package, clone this repository and then run:

pip install -e .

For detailed usage examples, see the notebooks directory.

The text2im notebook shows how to use GLIDE (filtered) with classifier-free guidance to produce images conditioned on text prompts.
The inpaint notebook shows how to use GLIDE (filtered) to fill in a masked region of an image, conditioned on a text prompt.
The clip_guided notebook shows how to use GLIDE (filtered) + a filtered noise-aware CLIP model to produce images conditioned on text prompts.

glide-text2im's People

Contributors

Stargazers

Watchers

Forkers

soskek ma7dev kiku-jw boxengasse abstraxio kniesz hercules261188 dmarx afiaka87 jonathansum thegeniverse nkapchenko kyrpel loretoparisi namheegordonkim mrm8488 vorti2 kenichiruf pabl0id stacklikemind chenxwh inarikami kastnerkyle tacerdi techthiyanes nanlliu ilyeong-ai duyuankai1992 eeegnu liuqinglong110 zhanghongyong123456 wpmelene amarsaini ixobert namnaku87 irononet james4388 zongking123 wty-ustc techris45 rcaio zerolands npupp-python the-adam-the wangdm05 fcelya roycezjq buaaalban shinyun wn1695173791 pingcsu peterzhousz longjohncoder atamazian tymaka mehdidc jingdujingdu soldatsc 3dtopo waynewaynehello adheep whittle777 diogenesthegreat liuweijiesky 1987981838 ir-st louiskin59 ml-lab napkin-dl undecidedzogvisvitalispotent8stars360 ajinkyapuar wxrui ai-hub-deep-learning-fundamental dafrenchyman stella-98 tommahle shaun95 hdvvip celsopitta joeau i-endran shinypond xiankgx shuvozitghose tivonrice jitsecurity-oss edgar-orozco elliotthwang kapitsa2811 becktor joskid neobrainz leimapapa austintapp jcrouzzo warriorhdz amgmnplk ianswebpage pandinosaurus kr-happyface

glide-text2im's Issues

Better resolution images for inpainting

Hello, thank you for this model!

I have been wondering how to get better resolution on the outputs for inpainting. I believe that the main issue is the downsizing of the input image to 64X64, which loses a lot of resolution. Then, the upsampling can only be done up to 256x256 (more will create artifacts).

I have tried to replace the 64X64 to something like 128X128 (which then will make it easier to upsample to 512X512) but got the below error.

Is there a way to improve the output resolution of the inpainting model? In particular, to test my hypothesis that the low resolution is due to the downsampling - fill in - upsampling low resolutions?

This is the error I am getting when replacing 64X64 -> 128x128 on the Colab:

Thanks!

The result of generating people is incredible

hi I try to use your great work to do some test.
but I found the result is incredible when the prompt is about women and man.

for example, when the prompt is "A woman with long hair and glasses "
result is

I think this model is unfriendly to generate people, isnt'it??

In GPU mode generated image is all black with NaN tensor values (no problems in CPU mode)

Hello,
For both "text2im.ipynb" and "clip_guided.ipynb" I'm seeing that the generated image is all black.
This only happens in GPU mode (Nvidia GTX 1660 TI, 6 GB), while in CPU mode the image is generated correctly.
I'm on Windows 10 using Python 3.8 and

torch-1.11.0+cu115 pypi_0 pypi
torchvision-0.12.0+cu115 pypi_0 pypi

and this environment works fine for all other ML projects I'm running.

In "text2im.ipynb" I saw that tensor values become NaN in the model_fn function, when model() is called:

# Create a classifier-free guidance sampling function
def model_fn(x_t, ts, **kwargs):
    half = x_t[: len(x_t) // 2]
    combined = th.cat([half, half], dim=0)    
#-----
    # Values of 'combined' are not NaN
    model_out = model(combined, ts, **kwargs)
    # Values of 'model_out' are NaN
#-----        
    eps, rest = model_out[:, :3], model_out[:, 3:]
    cond_eps, uncond_eps = th.split(eps, len(eps) // 2, dim=0)
    half_eps = uncond_eps + guidance_scale * (cond_eps - uncond_eps)
    eps = th.cat([half_eps, half_eps], dim=0)
    return th.cat([eps, rest], dim=1)

As I tried to track down the problem a bit further, I found that the values start getting wrong in the forward function of "text2im_model.py":

glide-text2im/glide_text2im/text2im_model.py

Lines 123 to 142 in 69b5307

    
           def forward(self, x, timesteps, tokens=None, mask=None): 
        
               hs = [] 
        
               emb = self.time_embed(timestep_embedding(timesteps, self.model_channels)) 
        
               if self.xf_width: 
        
                   text_outputs = self.get_text_emb(tokens, mask) 
        
                   xf_proj, xf_out = text_outputs["xf_proj"], text_outputs["xf_out"] 
        
                   emb = emb + xf_proj.to(emb) 
        
               else: 
        
                   xf_out = None 
        
               h = x.type(self.dtype) 
        
               for module in self.input_blocks: 
        
                   h = module(h, emb, xf_out) 
        
                   hs.append(h) 
        
               h = self.middle_block(h, emb, xf_out) 
        
               for module in self.output_blocks: 
        
                   h = th.cat([h, hs.pop()], dim=1) 
        
                   h = module(h, emb, xf_out) 
        
               h = h.type(x.dtype) 
        
               h = self.out(h) 
        
               return h

specifically at line 133, where module is called:

glide-text2im/glide_text2im/text2im_model.py

Lines 133 to 135 in 69b5307

    
           for module in self.input_blocks: 
        
               h = module(h, emb, xf_out) 
        
               hs.append(h)

Here, at iteration # 2 some values become NaN and at iteration # 6 all values become NaN.

Please take a look:

----------------- INSIDE FOR LOOP, iteration #:  1 
----------------- INSIDE FOR LOOP, value of 'h' before module call: 

 tensor([[[[ 0.9609,  0.4629, -0.9834,  ...,  1.6162, -0.5767, -0.4253],
          [ 0.5947, -0.8301,  1.7686,  ..., -2.5215,  0.2920, -0.2183],
          [ 1.9561, -0.8403,  0.4053,  ...,  0.4990, -2.0176, -0.2935],
          ...,
          [ 1.8125, -0.4285,  0.1121,  ..., -1.1416, -2.6562, -1.1348],
          [ 0.9204, -0.4434, -0.1824,  ...,  0.2864,  1.7188, -0.8999],
          [ 1.8369,  0.2583,  0.4895,  ...,  1.4004,  1.5371,  2.8203]],

         [[ 1.7607,  0.4749,  1.9160,  ..., -0.6079, -0.5513, -3.0527],
          [ 0.9780,  1.3984,  1.7266,  ...,  0.2903, -0.7969, -1.4316],
          [-0.5293, -2.6465, -1.6699,  ..., -0.2900, -1.6738,  0.6704],
          ...,
          [ 0.0657, -0.7827,  1.1904,  ..., -0.3643,  0.7754, -0.8740],
          [ 1.0801, -1.1260, -0.1700,  ...,  1.4443, -0.3196, -0.1392],
          [-1.0645,  1.0898, -0.3838,  ...,  0.3491,  0.4077, -1.4492]],

         [[ 0.1176,  0.6514,  0.8452,  ...,  1.3486, -2.3496, -0.1377],
          [-1.6523, -0.1711, -0.1355,  ...,  1.2236,  1.0068,  1.9863],
          [ 0.7456,  1.1943,  0.1819,  ..., -2.1719,  1.7148,  0.0917],
          ...,
          [ 0.4253, -1.0078,  0.7847,  ...,  1.1348,  0.8101,  0.7744],
          [-1.1299, -0.0173, -0.5522,  ...,  0.3960,  1.0762,  0.1404],
          [-0.0644, -0.0656,  1.1670,  ..., -0.1234,  0.6870, -0.5278]]],
...
device='cuda:0', dtype=torch.float16)

----------------- INSIDE FOR LOOP, module function that will now be called is: 

 TimestepEmbedSequential(
  (0): Conv2d(3, 192, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
)
----------------- INSIDE FOR LOOP, value of 'h' after module call: 

 tensor([[[[-0.3325, -0.4204, -1.3887,  ...,  0.0850, -0.1570, -0.6255],
          [ 0.5010, -0.4548,  0.2632,  ..., -1.8027, -0.2144, -1.4512],
          [ 0.1343, -1.0498,  0.4097,  ..., -0.0427, -2.1836, -0.3203],
          ...,
          [-0.2983, -0.2622, -1.0098,  ..., -1.7773, -1.7871, -1.3760],
          [ 0.1865, -0.8691, -0.1841,  ..., -0.5342, -0.8232, -1.7949],
          [ 0.4858, -0.7051, -0.7515,  ...,  0.7300,  0.0771,  0.6509]],

         [[-0.5107, -0.1924,  0.4790,  ..., -1.6797,  1.5586, -1.1074],
          [-0.8438, -1.3945, -0.8652,  ..., -0.1021, -1.9297, -1.8242],
          [-1.6289,  0.6030, -1.5410,  ...,  1.0488, -0.4473,  0.7524],
          ...,
          [-2.0586,  0.6978, -1.9316,  ..., -1.4785,  1.0742,  0.2190],
          [-1.0010, -0.6309,  0.3979,  ...,  0.3286, -0.3005,  0.8218],
          [-1.4961, -1.0723, -1.5293,  ...,  1.8125, -0.7954, -0.2915]],
...
device='cuda:0', dtype=torch.float16)

----------------- INSIDE FOR LOOP, iteration #:  2 
----------------- INSIDE FOR LOOP, value of 'h' before module call: 

 tensor([[[[-0.3325, -0.4204, -1.3887,  ...,  0.0850, -0.1570, -0.6255],
          [ 0.5010, -0.4548,  0.2632,  ..., -1.8027, -0.2144, -1.4512],
          [ 0.1343, -1.0498,  0.4097,  ..., -0.0427, -2.1836, -0.3203],
          ...,
          [-0.2983, -0.2622, -1.0098,  ..., -1.7773, -1.7871, -1.3760],
          [ 0.1865, -0.8691, -0.1841,  ..., -0.5342, -0.8232, -1.7949],
          [ 0.4858, -0.7051, -0.7515,  ...,  0.7300,  0.0771,  0.6509]],

         [[-0.5107, -0.1924,  0.4790,  ..., -1.6797,  1.5586, -1.1074],
          [-0.8438, -1.3945, -0.8652,  ..., -0.1021, -1.9297, -1.8242],
          [-1.6289,  0.6030, -1.5410,  ...,  1.0488, -0.4473,  0.7524],
          ...,
          [-2.0586,  0.6978, -1.9316,  ..., -1.4785,  1.0742,  0.2190],
          [-1.0010, -0.6309,  0.3979,  ...,  0.3286, -0.3005,  0.8218],
          [-1.4961, -1.0723, -1.5293,  ...,  1.8125, -0.7954, -0.2915]],
...
device='cuda:0', dtype=torch.float16)

----------------- INSIDE FOR LOOP, module function that will now be called is: 
 TimestepEmbedSequential(
  (0): ResBlock(
    (in_layers): Sequential(
      (0): GroupNorm32(32, 192, eps=1e-05, affine=True)
      (1): Identity()
      (2): Conv2d(192, 192, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
    )
    (h_upd): Identity()
    (x_upd): Identity()
    (emb_layers): Sequential(
      (0): SiLU()
      (1): Linear(in_features=768, out_features=384, bias=True)
    )
    (out_layers): Sequential(
      (0): GroupNorm32(32, 192, eps=1e-05, affine=True)
      (1): SiLU()
      (2): Dropout(p=0.1, inplace=False)
      (3): Conv2d(192, 192, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
    )
    (skip_connection): Identity()
  )
)

----------------- INSIDE FOR LOOP, value of 'h' after module call: 

 tensor([[[[    nan,     nan,     nan,  ...,     nan,     nan,     nan],
          [    nan,     nan,     nan,  ...,     nan,     nan,     nan],
          [    nan,     nan,     nan,  ...,     nan,     nan,     nan],
          ...,
          [    nan,     nan,     nan,  ...,     nan,     nan,     nan],
          [    nan,     nan,     nan,  ...,     nan,     nan,     nan],
          [    nan,     nan,     nan,  ...,     nan,     nan,     nan]],

         [[-0.6113, -0.2927,  0.3787,  ..., -1.7803,  1.4580, -1.2080],
          [-0.9443, -1.4951, -0.9658,  ..., -0.2024, -2.0293, -1.9248],
          [-1.7295,  0.5024, -1.6416,  ...,  0.9482, -0.5479,  0.6519],
          ...,
          [-2.1582,  0.5972, -2.0312,  ..., -1.5791,  0.9736,  0.1186],
          [-1.1016, -0.7314,  0.2976,  ...,  0.2283, -0.4009,  0.7212],
          [-1.5967, -1.1729, -1.6299,  ...,  1.7119, -0.8960, -0.3918]],
...
device='cuda:0', dtype=torch.float16)

As you can see at this point only some values have become NaN.
This remain like so until iteration # 6, where, after the module call ALL values become NaN:

----------------- INSIDE FOR LOOP, iteration #:  6 
----------------- INSIDE FOR LOOP, value of 'h' before module call: 

 tensor([[[[        nan,         nan,         nan,  ...,         nan,
                   nan,         nan],
          [        nan,         nan,         nan,  ...,         nan,
                   nan,         nan],
          [        nan,         nan,         nan,  ...,         nan,
                   nan,         nan],
          ...,
          [        nan,         nan,         nan,  ...,         nan,
                   nan,         nan],
          [        nan,         nan,         nan,  ...,         nan,
                   nan,         nan],
          [        nan,         nan,         nan,  ...,         nan,
                   nan,         nan]],

         [[-9.6973e-01,  3.6084e-01, -8.0078e-01,  ..., -6.1328e-01,
           -1.1406e+00, -1.0596e+00],
          [-4.0210e-01, -1.0947e+00, -2.0898e-01,  ..., -7.3730e-01,
           -6.4258e-01, -3.1860e-01],
          [-4.3530e-01, -4.1577e-01, -4.6655e-01,  ...,  5.1880e-02,
            1.5601e-01, -4.0283e-02],
          ...,
          [-5.6934e-01,  2.7954e-01, -1.4346e+00,  ..., -4.4751e-01,
           -1.3428e-02, -2.9565e-01],
          [-5.2148e-01, -6.8652e-01, -8.8770e-01,  ..., -2.4341e-01,
           -1.3213e+00,  2.9517e-01],
          [-1.2842e+00, -6.5234e-01, -1.9214e-01,  ..., -1.8779e+00,
           -3.9526e-01, -3.7500e-01]],
...
device='cuda:0', dtype=torch.float16)

----------------- INSIDE FOR LOOP, module function that will now be called is: 

 TimestepEmbedSequential(
  (0): ResBlock(
    (in_layers): Sequential(
      (0): GroupNorm32(32, 192, eps=1e-05, affine=True)
      (1): Identity()
      (2): Conv2d(192, 384, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
    )
    (h_upd): Identity()
    (x_upd): Identity()
    (emb_layers): Sequential(
      (0): SiLU()
      (1): Linear(in_features=768, out_features=768, bias=True)
    )
    (out_layers): Sequential(
      (0): GroupNorm32(32, 384, eps=1e-05, affine=True)
      (1): SiLU()
      (2): Dropout(p=0.1, inplace=False)
      (3): Conv2d(384, 384, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
    )
    (skip_connection): Conv2d(192, 384, kernel_size=(1, 1), stride=(1, 1))
  )
  (1): AttentionBlock(
    (norm): GroupNorm32(32, 384, eps=1e-05, affine=True)
    (qkv): Conv1d(384, 1152, kernel_size=(1,), stride=(1,))
    (attention): QKVAttention()
    (encoder_kv): Conv1d(512, 768, kernel_size=(1,), stride=(1,))
    (proj_out): Conv1d(384, 384, kernel_size=(1,), stride=(1,))
  )
)
----------------- INSIDE FOR LOOP, value of 'h' after module call: 

 tensor([[[[nan, nan, nan,  ..., nan, nan, nan],
          [nan, nan, nan,  ..., nan, nan, nan],
          [nan, nan, nan,  ..., nan, nan, nan],
          ...,
          [nan, nan, nan,  ..., nan, nan, nan],
          [nan, nan, nan,  ..., nan, nan, nan],
          [nan, nan, nan,  ..., nan, nan, nan]],

         [[nan, nan, nan,  ..., nan, nan, nan],
          [nan, nan, nan,  ..., nan, nan, nan],
          [nan, nan, nan,  ..., nan, nan, nan],
          ...,
          [nan, nan, nan,  ..., nan, nan, nan],
          [nan, nan, nan,  ..., nan, nan, nan],
          [nan, nan, nan,  ..., nan, nan, nan]],
...
device='cuda:0', dtype=torch.float16)

With my limited knowledge of this field this is all I could find.
Please let me know if there some other info I can provide.

About CLIP training on nosied images

Hey! I think GLIDE is a wonderful work. But I have a question about CLIP training on nosied images.

I want to know why CLIP can be trained on nosied images. I think if t (range from 0 to 1000) is large(maybe close to 500 or more), then the noised images hardly contain any semantic information. In this case, I want to know CLIP model how to encode similar features from noised images and text and I also think it may cause model to not converge (because it is hard to encode similar features between noised images and text)

Something wrong with upsample-inpaint checkpoint

When the program runs to: model_up.load_state_dict(load_checkpoint('upsample-inpaint', device))
It reports an error: RuntimeError: PytorchStreamReader failed reading zip archive: failed finding central directory]()

Training code?

Thanks for a truly awe feat of Machine Learning. Do you have any plans to release the training code?

some errors

How to get a picture?How to get the result picture?I run the sample code, but I can't get the result picture.
my code is

from PIL import Image
from IPython.display import display
import torch as th

from glide_text2im.download import load_checkpoint
from glide_text2im.model_creation import (
    create_model_and_diffusion,
    model_and_diffusion_defaults,
    model_and_diffusion_defaults_upsampler
)

### This notebook supports both CPU and GPU.
### On CPU, generating one sample may take on the order of 20 minutes.
### On a GPU, it should be under a minute.

has_cuda = th.cuda.is_available()
device = th.device('cpu' if not has_cuda else 'cuda')

### Create base model.
options = model_and_diffusion_defaults()
options['use_fp16'] = has_cuda
options['timestep_respacing'] = '100' # use 100 diffusion steps for fast sampling
model, diffusion = create_model_and_diffusion(**options)
model.eval()
if has_cuda:
    model.convert_to_fp16()
model.to(device)
model.load_state_dict(load_checkpoint('base', device))
print('total base parameters', sum(x.numel() for x in model.parameters()))

### Create upsampler model.
options_up = model_and_diffusion_defaults_upsampler()
options_up['use_fp16'] = has_cuda
options_up['timestep_respacing'] = 'fast27' # use 27 diffusion steps for very fast sampling
model_up, diffusion_up = create_model_and_diffusion(**options_up)
model_up.eval()
if has_cuda:
    model_up.convert_to_fp16()
model_up.to(device)
model_up.load_state_dict(load_checkpoint('upsample', device))
print('total upsampler parameters', sum(x.numel() for x in model_up.parameters()))

def show_images(batch: th.Tensor):
    """ Display a batch of images inline. """
    scaled = ((batch + 1)*127.5).round().clamp(0,255).to(th.uint8).cpu()
    reshaped = scaled.permute(2, 0, 3, 1).reshape([batch.shape[2], -1, 3])
    display(Image.fromarray(reshaped.numpy()))

### Sampling parameters
prompt = "an oil painting of a corgi"
batch_size = 1
guidance_scale = 3.0

### Tune this parameter to control the sharpness of 256x256 images.
### A value of 1.0 is sharper, but sometimes results in grainy artifacts.
upsample_temp = 0.997

##############################
### Sample from the base model ###
##############################

### Create the text tokens to feed to the model.
tokens = model.tokenizer.encode(prompt)
tokens, mask = model.tokenizer.padded_tokens_and_mask(
    tokens, options['text_ctx']
)

### Create the classifier-free guidance tokens (empty)
full_batch_size = batch_size * 2
uncond_tokens, uncond_mask = model.tokenizer.padded_tokens_and_mask(
    [], options['text_ctx']
)

### Pack the tokens together into model kwargs.
model_kwargs = dict(
    tokens=th.tensor(
        [tokens] * batch_size + [uncond_tokens] * batch_size, device=device
    ),
    mask=th.tensor(
        [mask] * batch_size + [uncond_mask] * batch_size,
        dtype=th.bool,
        device=device,
    ),
)

### Create a classifier-free guidance sampling function
def model_fn(x_t, ts, **kwargs):
    half = x_t[: len(x_t) // 2]
    combined = th.cat([half, half], dim=0)
    model_out = model(combined, ts, **kwargs)
    eps, rest = model_out[:, :3], model_out[:, 3:]
    cond_eps, uncond_eps = th.split(eps, len(eps) // 2, dim=0)
    half_eps = uncond_eps + guidance_scale * (cond_eps - uncond_eps)
    eps = th.cat([half_eps, half_eps], dim=0)
    return th.cat([eps, rest], dim=1)

### Sample from the base model.
model.del_cache()
samples = diffusion.p_sample_loop(
    model_fn,
    (full_batch_size, 3, options["image_size"], options["image_size"]),
    device=device,
    clip_denoised=True,
    progress=True,
    model_kwargs=model_kwargs,
    cond_fn=None,
)[:batch_size]
model.del_cache()

### Show the output
show_images(samples)

##############################
### Upsample the 64x64 samples ###
##############################

tokens = model_up.tokenizer.encode(prompt)
tokens, mask = model_up.tokenizer.padded_tokens_and_mask(
    tokens, options_up['text_ctx']
)

### Create the model conditioning dict.
model_kwargs = dict(
    ### Low-res image to upsample.
    low_res=((samples+1)*127.5).round()/127.5 - 1,

    ### Text tokens
    tokens=th.tensor(
        [tokens] * batch_size, device=device
    ),
    mask=th.tensor(
        [mask] * batch_size,
        dtype=th.bool,
        device=device,
    ),
)

### Sample from the base model.
model_up.del_cache()
up_shape = (batch_size, 3, options_up["image_size"], options_up["image_size"])
up_samples = diffusion_up.ddim_sample_loop(
    model_up,
    up_shape,
    noise=th.randn(up_shape, device=device) * upsample_temp,
    device=device,
    clip_denoised=True,
    progress=True,
    model_kwargs=model_kwargs,
    cond_fn=None,
)[:batch_size]
model_up.del_cache()

### Show the output
show_images(up_samples)

Higher Resolution

Is there a way to upsize the outputs to something closer to 1024px? I've noticed a few people on twitter that have been able to do so with this model but after trying to change the image size to a higher value I get this error for anything over 256 -

/usr/local/lib/python3.7/dist-packages/glide_text2im/model_creation.py in create_model(image_size, num_channels, num_res_blocks, channel_mult, attention_resolutions, num_heads, num_head_channels, num_heads_upsample, use_scale_shift_norm, dropout, text_ctx, xf_width, xf_layers, xf_heads, xf_final_ln, xf_padding, resblock_updown, use_fp16, cache_text_emb, inpaint, super_res)
140 channel_mult = (1, 2, 3, 4)
141 else:
--> 142 raise ValueError(f"unsupported image size: {image_size}")
143 else:
144 channel_mult = tuple(int(ch_mult) for ch_mult in channel_mult.split(","))
ValueError: unsupported image size: 1024

Missing `regex`

The project requires regex and it is missing from setup.py.

Question about generating masks

Hi, thanks for your great work. I have a question related to mask generation in "bpe.py".

As shown in the above figure, it seems that len(tokens) = text_ctx, and then padding = 0. Does this mean there is no padding mask?

Best wishes,

Colab notebook

I have slightly changed the structure of your text2im notebook so that:

it runs straight away on Google Colab with GPU toggled ON,
it is easier for everyone to try different prompts.

Run text2im.ipynb

Reference: https://github.com/woctezuma/glide-text2im-colab

Training Code

Really Great Work!
I wonder if you'll make training scripts available or not. If yes, are there any plans about when you'll make it available?
Thanks

X

While running the clip_guided notebook in CPU mode I get: "RuntimeError - Expected tensor for argument #1 'indices' to have scalar type Long; but got torch.FloatTensor instead"

When I run clip_guided notebook in CPU mode, I get the following error at the "Sample from the base model" cell:

---------------------------------------------------------------------------
RuntimeError                              Traceback (most recent call last)
~\AppData\Local\Temp/ipykernel_9272/4093479580.py in <module>
     20 # Sample from the base model.
     21 model.del_cache()
---> 22 samples = diffusion.p_sample_loop(
     23     model,
     24     (batch_size, 3, options["image_size"], options["image_size"]),

c:\users\alf\downloads\glide-text2im\glide_text2im\gaussian_diffusion.py in p_sample_loop(self, model, shape, noise, clip_denoised, denoised_fn, cond_fn, model_kwargs, device, progress)
    387         """
    388         final = None
--> 389         for sample in self.p_sample_loop_progressive(
    390             model,
    391             shape,

c:\users\alf\downloads\glide-text2im\glide_text2im\gaussian_diffusion.py in p_sample_loop_progressive(self, model, shape, noise, clip_denoised, denoised_fn, cond_fn, model_kwargs, device, progress)
    439             t = th.tensor([i] * shape[0], device=device)
    440             with th.no_grad():
--> 441                 out = self.p_sample(
    442                     model,
    443                     img,

c:\users\alf\downloads\glide-text2im\glide_text2im\gaussian_diffusion.py in p_sample(self, model, x, t, clip_denoised, denoised_fn, cond_fn, model_kwargs)
    351         )  # no noise when t == 0
    352         if cond_fn is not None:
--> 353             out["mean"] = self.condition_mean(cond_fn, out, x, t, model_kwargs=model_kwargs)
    354         sample = out["mean"] + nonzero_mask * th.exp(0.5 * out["log_variance"]) * noise
    355         return {"sample": sample, "pred_xstart": out["pred_xstart"]}

c:\users\alf\downloads\glide-text2im\glide_text2im\respace.py in condition_mean(self, cond_fn, *args, **kwargs)
     95 
     96     def condition_mean(self, cond_fn, *args, **kwargs):
---> 97         return super().condition_mean(self._wrap_model(cond_fn), *args, **kwargs)
     98 
     99     def condition_score(self, cond_fn, *args, **kwargs):

c:\users\alf\downloads\glide-text2im\glide_text2im\gaussian_diffusion.py in condition_mean(self, cond_fn, p_mean_var, x, t, model_kwargs)
    287         This uses the conditioning strategy from Sohl-Dickstein et al. (2015).
    288         """
--> 289         gradient = cond_fn(x, t, **model_kwargs)
    290         new_mean = p_mean_var["mean"].float() + p_mean_var["variance"] * gradient.float()
    291         return new_mean

c:\users\alf\downloads\glide-text2im\glide_text2im\respace.py in __call__(self, x, ts, **kwargs)
    122         new_ts_2 = map_tensor[ts.ceil().long()]
    123         new_ts = th.lerp(new_ts_1, new_ts_2, frac)
--> 124         return self.model(x, new_ts, **kwargs)

c:\users\alf\downloads\glide-text2im\glide_text2im\clip\model_creation.py in cond_fn(x, t, grad_scale, **kwargs)
     57             with torch.enable_grad():
     58                 x_var = x.detach().requires_grad_(True)
---> 59                 z_i = self.image_embeddings(x_var, t)
     60                 loss = torch.exp(self.logit_scale) * (z_t * z_i).sum()
     61                 grad = torch.autograd.grad(loss, x_var)[0].detach()

c:\users\alf\downloads\glide-text2im\glide_text2im\clip\model_creation.py in image_embeddings(self, images, t)
     47 
     48     def image_embeddings(self, images: torch.Tensor, t: torch.Tensor) -> torch.Tensor:
---> 49         z_i = self.image_encoder((images + 1) * 127.5, t)
     50         return z_i / (torch.linalg.norm(z_i, dim=-1, keepdim=True) + 1e-12)
     51 

~\.conda\envs\glide-text2im\lib\site-packages\torch\nn\modules\module.py in _call_impl(self, *input, **kwargs)
    725             result = self._slow_forward(*input, **kwargs)
    726         else:
--> 727             result = self.forward(*input, **kwargs)
    728         for hook in itertools.chain(
    729                 _global_forward_hooks.values(),

c:\users\alf\downloads\glide-text2im\glide_text2im\clip\encoders.py in forward(self, image, timesteps, return_probe_features)
    483     ) -> torch.Tensor:
    484         n_batch = image.shape[0]
--> 485         h = self.blocks["input"](image, t=timesteps)
    486 
    487         for i in range(self.n_xf_blocks):

~\.conda\envs\glide-text2im\lib\site-packages\torch\nn\modules\module.py in _call_impl(self, *input, **kwargs)
    725             result = self._slow_forward(*input, **kwargs)
    726         else:
--> 727             result = self.forward(*input, **kwargs)
    728         for hook in itertools.chain(
    729                 _global_forward_hooks.values(),

c:\users\alf\downloads\glide-text2im\glide_text2im\clip\encoders.py in forward(self, x, t)
    124             self.pred_state[None, None].expand(x.shape[0], -1, -1)
    125             if self.n_timestep == 0
--> 126             else F.embedding(cast(torch.Tensor, t), self.w_t)[:, None]
    127         )
    128         x = torch.cat((sot, x), dim=1) + self.w_pos[None]

~\.conda\envs\glide-text2im\lib\site-packages\torch\nn\functional.py in embedding(input, weight, padding_idx, max_norm, norm_type, scale_grad_by_freq, sparse)
   1850         # remove once script supports set_grad_enabled
   1851         _no_grad_embedding_renorm_(weight, input, max_norm, norm_type)
-> 1852     return torch.embedding(weight, input, padding_idx, scale_grad_by_freq, sparse)
   1853 
   1854 

RuntimeError: Expected tensor for argument #1 'indices' to have scalar type Long; but got torch.FloatTensor instead (while checking arguments for embedding)

Can anyone help?
Thanks!

Thanks!

Thank you for releasing the filtered model. I realize a lot of people have complained about this, but it's a dicey alignment issue unfortunately. That you did the extra work to release something was very cool.

Feel free to close 👍

Any plans for a blog post with more samples?

Hi, thank you for sharing the smaller model with community.

My question: are there any plans for a blog post with an interactive curated demo of the large model, something akin to the Dall-E?

Thank you in advance for the response.

No license

Hi! Currently, there is no license applied to this repository. Unfortunately, that means that by default, e.g. copying, modifying and distributing the code is forbidden. If this is intentional, please add a mention to the README about this. Otherwise, I suggest adding an open source license, such as MIT.

inpaint notebook error

The inpaint notebook will throw an error when you try to run the setup where you import the grass image. I believe this is caused by the file not being included with the notebook.

how to download the dataset？

Training parameters

Hi, I'm trying to reimplement the base model and super-resolution model, can you release details of optimization eg. lr schedule and optimizer parameters?

Inpaiting fin-tune details

Dear authors, I can ask you about the specifics of your Inpaiting fine-tune

Fixing Random Seed

Is there a way to fix the random seed for image sampling?

Xx

https://github.com/openai/glide-text2im/tree/main

Is the formula for CFG different from the reference?

CFG:
(1+w)*×ε(x,t,c) - w×ε(x,t,Ø)

How to get the results closer to what is shown in the paper?

Really inspirational work guys!

But the results from the published code and models are not even remotely comparable to the shown results in the paper. Is there anything we can do to get closer to the original work?

E.g. could we train on different (maybe bigger and more diverse) dataset?
Or do we need bigger model?
Or maybe tweaking the params a bit could help?

Image from the paper for: "a surrealist dream-like oil painting by salvador dalı́ of a cat playing checkers"

Image from the code for the same text prompt "a surrealist dream-like oil painting by salvador..."

It's almost like that meme: " Your vs. The guy she told you not to worry about" 🤣

Anyway, if you can give us some advice on this matter it would be greatly appreciated! 👍

Question about the CLIP model

Hi, thanks for your great work!

I found that you release several checkpoints, including CLIP ( "clip/image-enc": "https://openaipublic.blob.core.windows.net/diffusion/dec-2021/clip_image_enc.pt", "clip/text-enc": "https://openaipublic.blob.core.windows.net/diffusion/dec-2021/clip_text_enc.pt").

Are these checkpoints trained with noised images, or are they public CLIP models?

Best wishes,

disappointed, looks the model is poor for unseen data

Hi,
I just have a try of text2im notebook in colab with simple text "a man fighting with 2 women", and it gives me

Unfiltered data

Hello,

In order to see how the model would do on people or things like that, is it possible to get an unfiltered pretrained model, or a way to train it myself please ?
Is it possible to make the original dataset (filtered and unfiltered) available too please ?

Thank you a lot for that !

some questions of the result picture

prompt = "a woman"

What's wrong with this?

Created non-rectangular masks for inpainting

In the paper, I see masks that are non-rectangular (white blob in the sky in the image below):

but I think the 'mask' in the inpaint notebook is being applied on this line:

source_mask_64[:, :, 20:] = 0

which produces an image with a gray rectangle. Is there an example of how to create more complex masks?

How could I load a mask generated by myself?

At first, I would like to say it's a amazing work. But when I try to change the code of 'inpaint.py' for using my own mask dataset, I realized it is uneasy. Because the mask was set by three lines. So, I want to ask for the code to use my own mask dataset like generated by PCov.
Thanks a lot.

Experimental IS and FID values without classifier guidance and CLIP guidance

CUDA out of memory

I get a OOM when loading the upsample model:

options_up = model_and_diffusion_defaults_upsampler()
options_up['use_fp16'] = has_cuda
options_up['timestep_respacing'] = 'fast27' # use 27 diffusion steps for very fast sampling
model_up, diffusion_up = create_model_and_diffusion(**options_up)
model_up.eval()
if has_cuda:
    model_up.convert_to_fp16()
model_up.to(device)
model_up.load_state_dict(load_checkpoint('upsample', device))
print('total upsampler parameters', sum(x.numel() for x in model_up.parameters()))

the allocation error was

RuntimeError: CUDA out of memory. Tried to allocate 100.00 MiB (GPU 0; 3.94 GiB total capacity; 3.00 GiB already allocated; 30.94 MiB free; 3.06 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF

mynvidia-smiis

loreto@ombromanto:~/Projects/glide-text2im$ nvidia-smi
Wed Dec 22 20:39:15 2021       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 460.32.03    Driver Version: 460.32.03    CUDA Version: 11.2     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  GeForce GTX 105...  Off  | 00000000:01:00.0  On |                  N/A |
| 45%   23C    P5    N/A /  75W |   3994MiB /  4033MiB |      2%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
|    0   N/A  N/A      1094      G   /usr/lib/xorg/Xorg                121MiB |
|    0   N/A  N/A      1926      G   /usr/bin/gnome-shell               26MiB |
|    0   N/A  N/A      3532      G   ...AAAAAAAA== --shared-files       22MiB |
|    0   N/A  N/A      4795      C   /usr/bin/python                  3819MiB |
+-----------------------------------------------------------------------------+

Ways to reduce number of failed inpaints?

When I use the model for inpainting, there is a large chance that an object will fail to inpaint, and instead GLIDE will simply guess at the background without inserting the object into the masked area. This is much more likely when the mask box is small, but is still common on large boxes too.

Here is an example:
Mask

Inpaint

My pipeline is to crop a 256x256 box with the mask as close to the center as possible. Then I downsample that, inpaint, run upsampler, and replace the 256x256 box.

Is there any procedure I should use or parameter I should tune to reduce the number of misses?
Thanks.

Has the inpainting Colab been developed with the CLIP-guided version?

If so, can we retrieve the CLIP score?

Thanks so much!!

In the clipped guided version, were GLIDE (filtered) and CLIP trained together?

In the training phase, were GLIDE (filtered) and CLIP trained together? Or they were trained separately but when inference, they are used together?

usage questsions

When using the batch feature how do you save the 256 upsample?
How do you create 384 or 512px images instead of 256?
How do you make more surrealistic images?

How can i use .py file to run project in pycharm?

Request: Training script

Hi, I wonder if you could provide a training script to train on our own custom datasets. Thank you.

GUIDE: Using GLIDE in pipenv instead of pip

I've created a guide for people who have moved on to the more advanced pipenv instead of basic pip.

Install Pipenv and Pyenv on your system so that both tools are available. Pyenv is optional but is required for fetching the correct version of Python on-demand. But if you're reading this, you're probably a chad that has both tools already!
Create a new, empty project folder. Then navigate to it in a terminal window.
Run the following command to initialize a pipenv virtualenv with Python 3.9 (that's, as of this writing, the newest version that PyTorch supports).

pipenv install --python=3.9

Now you need to choose command based on what GPU you have. Note that the command will take a VERY long time to install, because it will download multiple versions of the 1.7 GB large PyTorch archive (for me it took 30 minutes on a 100mbit connection, and downloaded 16 GB). This is PyTorch's fault for using a non-standard repository format and a non-standard package versioning system (the "+cu113" junk), which means that Pipenv has trouble figuring out what to do (since Pipenv only follows Python PEP standards for how repositories should look), so it grabs everything that matches the query. Which is all architectures... (If you're on Linux, just check du -h ~/.cache/pipenv and you'll see that it's downloading gigabytes of packages...)

If you use an Ampere NVIDIA GPU (RTX 30x0 series or newer), or your CUDA version is 11.x or newer (check with nvidia-smi in a terminal) then use this repository (which uses CUDA Toolkit 11.3), since those modern GPUs require that applications be based on CUDA Toolkit 11.x.

pipenv install --extra-index-url https://download.pytorch.org/whl/cu113/ "torch==1.10.1+cu113"

If you use an older NVIDIA GPU (before Ampere), you may have to use this repository instead (which uses CUDA Toolkit 10.2):

pipenv install --extra-index-url https://download.pytorch.org/whl/ "torch==1.10.1+cu102"

When these versions are updated by PyTorch in the future, you'll have to open the repo URLs in a web browser and see if a newer version is available, and then use its exact name including the +cu### part... To find out the latest CUDA toolkits they're offering support for, go to https://pytorch.org/ and look at the CUDA versions they offer, and then modify the repo URLs above accordingly (/whl/ is the standard "old/stable" CUDA Toolkit, and /whl/cu113/ is the currently newest version 11.3 but it will change later). I really wish that PyTorch could properly name their packages and set up proper repos instead, but they don't, so we're stuck with this solution of specifying exact versions and repo paths. If we don't include +cu### in the version, then pipenv downloads older CUDA toolkit versions of the packages instead, so beware and be careful.
Also note that if you ever see a PyTorch repo URL that ends in <something>.html then you need to delete that HTML filename because that's PyTorch's oldest, pip-only repo format which is a HTML document that mashes together all versions of the packages (CUDA, CPU-only, etc) which makes it completely impossible for Pipenv to even figure out which architecture to download... PyTorch implemented the newer PEP 503 indexes, but only on URLs that don't point at any HTML page..
If someone doesn't have a NVIDIA CUDA GPU, there are versions with +cpu in the name in the /whl/ repository, such as:

pipenv install --extra-index-url https://download.pytorch.org/whl/ "torch==1.10.1+cpu"

You also need to install Numpy and PyYAML.

pipenv install numpy pyyaml

Next, clone the GLIDE library repo into a subfolder:

git clone https://github.com/openai/glide-text2im.git lib/glide-text2im

Tell Pipenv to install the "local library" (GLIDE). This will automatically detect your Pipfile in the parent folder and will add it to your Pipfile too. Note that this command must be run from the directory that the Pipfile is in, because it will treat the -e <path> as a relative path from the current working dir. I suppose you could provide a full, absolute path. But we'll do relative here. Oh and this command takes a while since it downloads dependencies.

pipenv install -e ./lib/glide-text2im

Create a subfolder for your own Python project files:

mkdir src && cd src

Now simply create your Python files in src/, "import" the library as seen in GLIDE's examples, and have fun. You're able to use the code from the Notebook example files that GLIDE has provided.
Running your code in the pipenv (virtualenv) must be done with a special command, so that it loads the Python version and virtualenv libraries that you've installed:

pipenv run python <yoursourcefilehere.py>

When will you release the full code

Why clip_guided works better than text2im, inconsistent with the paper's claim?

I have tested the glide model for a few days (I tried many kinds of prompts), and my result is that clip_guided works better than classifier-free text2im.

clip_guided can correctly follow the intention of my prompt, like "a boat on the top of the mountain", or "Pablo Picasso: Into the wind", and text2im failed to do that.

However the paper claims that classifier-free text2im > clip_guided. I wonder why? Is there anything wrong with the released model?

YouTube video walk-through of this codebase

Hi @adityaramesh @unixpickle @prafullasd!

Amazing work as always. :))

I created a YouTube video where I do a deep dive/walk-through of this repo.

I hope someone finds it useful: https://www.youtube.com/watch?v=c1GwVg3lt1c

Larger batch size to generate images in text2im.ipynb?

Hi, in the example notebook text2im.ipynb, I'm not clear on how to use a larger batch size that 1, or the recommended way to generate many images?

I'd like to play around with the model and generate several thousand images for some captions I have collected and evaluate the overall quality of results... however, I'm not clear on the best way to do this, rather than something along the lines of the psuedo-code below:

for each caption in my dataset:
      tokens = encode(caption)
      model_kwargs = {...}
      sample =  diffusion.p_sample_loop(...)
      save_sample(sample)

Would there be a faster way to do this than (more/less) following the recipe above?

	def forward(self, x, timesteps, tokens=None, mask=None):
	hs = []
	emb = self.time_embed(timestep_embedding(timesteps, self.model_channels))
	if self.xf_width:
	text_outputs = self.get_text_emb(tokens, mask)
	xf_proj, xf_out = text_outputs["xf_proj"], text_outputs["xf_out"]
	emb = emb + xf_proj.to(emb)
	else:
	xf_out = None
	h = x.type(self.dtype)
	for module in self.input_blocks:
	h = module(h, emb, xf_out)
	hs.append(h)
	h = self.middle_block(h, emb, xf_out)
	for module in self.output_blocks:
	h = th.cat([h, hs.pop()], dim=1)
	h = module(h, emb, xf_out)
	h = h.type(x.dtype)
	h = self.out(h)
	return h