mp2893 / med2vec Goto Github PK

Repository for Med2Vec project

License: BSD 3-Clause "New" or "Revised" License

Python 100.00%

med2vec's Introduction

Med2Vec

Med2Vec is a multi-layer representation learning tool for learning code representations and visit representations from EHR datasets.

Med2Vec embeddings not only help improve predictive performance of healthcare applications, but also enable the interpretation of the learned code representations in a coodinate-wise manner. You can see that these six coordinates (chosen by their strong correlation with patient severity level) of the code representation space demonstrate medically coherent groups of symptoms (diagnoses, medications, and procedures).

Relevant Publications

Med2Vec implements an algorithm introduced in the following paper:

Multi-layer Representation Learning for Medical Concepts
Edward Choi, Mohammad Taha Bahadori, Elizabeth Searles, Catherine Coffey, 
Michael Thompson, James Bost, Javier Tejedor-Sojo, Jimeng Sun
KDD 2016, pp.1495-1504

Running Med2Vec

STEP 1: Installation

Install python, Theano. We use Python 2.7, Theano 0.7. Theano can be easily installed in Ubuntu as suggested here
If you plan to use GPU computation, install CUDA
Download/clone the Med2Vec code

STEP 2: Fast way to test Med2Vec with MIMIC-III

This step describes how to run, with minimum number of steps, Med2Vec using MIMIC-III.

You will first need to request access for MIMIC-III, a publicly avaiable electronic health records collected from ICU patients over 11 years.
You can use "process_mimic.py" to process MIMIC-III dataset and generate a suitable training dataset for Med2Vec. Place the script to the same location where the MIMIC-III CSV files are located, and run the script. The execution command is python process_mimic.py ADMISSIONS.csv DIAGNOSES_ICD.csv <output file>. Instructions are described inside the script.
Run Med2Vec using the ".seqs" file generated by process_mimic.py, using the following command. python med2vec.py <seqs file> 4894 <output path> where 4894 is the number of unique ICD9 diagnosis codes in the dataset. As described in the paper, however, it is a good idea to use the grouped codes for training the Softmax component of Med2Vec. Therefore we recommend using the following command instead. python med2vec.py <seqs file> 4894 <output path> --label_file <3digitICD9.seqs file> --n_output_codes 942 where 942 is the number of unique 3-digit ICD9 diagnosis codes in the dataset. You can also use ".3digitICD9.seqs" to begin with, if you interested in learning the representation of 3-digit ICD9 codes only, using the following command. python med2vec.py <3digitICD9.seqs file> 942 <output path>
As suggested in STEP 4, you might want to adjust the hyper-parameters. I recommend decreasing the --batch_size to 100 or so, since the default value 1,000 is too big considering the small number of patients in MIMIC-III datasets. There are only 7,500 patients who made more than a single visit, and most of them have only two visits.

STEP 3: Preparing training data

Med2Vec training data need to be a Python Pickled list of list of medical codes (e.g. diagnosis codes, medication codes, or procedure codes). First, medical codes need to be converted to an integer. Then a single visit can be converted as a list of integers. For example, [5,8,15] means the patient was assigned with code 5, 8, and 15 at a certain visit. If a patient made two visits [1,2,3] and [4,5,6,7], it can be converted to a list of list [[1,2,3], [4,5,6,7]]. If there are multiple patients, each patient must be delimited by a list [-1]. For example, [[1,2,3], [4,5,6,7], [-1], [2,4], [8,3,1], [3]] means there are two patients where the first patient made two visits and the second patient made three visits. This list of list needs to be pickled using cPickle. We will refer to this file as the "visit file".
The total number of unique medical codes is required to run Med2Vec. For example, if the dataset is using 14,000 diagnosis codes and 11,000 procedure codes, the total number is 25,000. Note that using a huge number of codes could lead to memory problems, depending on your RAM/VRAM (thanks for the tip tRosenflanz)
For a faster training, you can provide an additional dataset, which is simply the same dataset in step 1, but with grouped medical codes. For example, ICD9 diagnosis codes can be grouped into 283 categories by using CCS groupers. You will still be able to learn the code representations for the original un-grouped codes. The grouped dataset is used only for speeding up the training speed. (Refer to section 4.4 of the paper) The grouped dataset should be prepared in the same way as the dataset in step 1. We will refer to this grouped dataset as the "label file".
Same as step 2, you will need to remember the total number of unique grouped codes if you plan to use this grouped dataset.
If you wish to use patient demographic information (e.g. age, weight, gender) you need to create a demographics vector for each visit the patient made. For example, if you are using age (real-valued) and ethnicity(categorical, assume 6 categories), you can create a vector such as [45.0, 0, 0, 0, 0, 1, 0]. Similar to the [-1] vector in step 1, each patient is delimited with an all-zero vector. Therefore the demographic information will be a pickled matrix where column size is the size of the demographics vector and row size is the number of total visits of all patients plus the delimiters. We will refer to this file as the "demo file".
Similar to step 2, you will need to remeber the size of the demographics vector if you plan to use the demo file. In the example of step 5, the size of the demographics vector is 7.

STEP 4: Running Med2Vec

The minimum input you need to run Med2Vec is the visit file, the number of unique medical codes and the output path python med2vec <path/to/visit_file> <the number of unique medical codes> <path/to/output>
Specifying --verbose option will print training process after each 10 mini-batches.
Additional options can be specified such as the size of the code representation, the size of the visit representation and the number of epochs. Detailed information can be accessed by python med2vec --help

STEP 5: Looking at your results

Med2Vec produces a model file after each epoch. The model file is generated by numpy.savez_compressed.

The 2D scatterplot of the learned code representations would look similar to this. (This is the scatterplot of the code representations trained with Non-negative Skip-gram, which is essentially Med2Vec minus the visit-level training)

med2vec's People

Contributors

Stargazers

Watchers

Forkers

mafia303 txd866 jayinai coloratto minxueric liucsthu xiangruicai binbenliu poohzrn vunb pipihou ml-lab bombrake dukyongyoon xuhanvsxuhan v-moura manashty o0windseed0o thupzj zkz huannen wuyuxiaobi callmerudy gkovaig milesqli reloadbrain wolfhu arghyadatta elhamdolatabadi afcarl robertwan91 zhengyan-bme patrickmcguinness datoslabs ssbpv shubhampachori12110095 zeroesones arnaudmkonan alkalami abdullahkhilji bin2000 joecomerisnotavailable ruiatelsevier sherrymo rollend mshapi2 aliciaframe shekerkamma yangshan33 ashish1208 zhamengf dewyork saridsa1 zyrr95 sky226 from2u arjavibahety hometownjlu naimahmednesaragi iskrypn medical-projects binfnstats shengzhew corey886 deng-sy shubhamguptaiitd zhukov123 adyachok lee-edgar

med2vec's Issues

high training cost

Hi Edward,
While I was searching for a new research idea, I've found your model and it was interesting in that it can learn code- and visit-level representation from EHRs simultaneously.

Using your model, I'm trying to learn embeddings that can represent measurements other than medical codes. However, the training cost seems quite high (around 150~250) and it doesn't converge(or go below 1) just like other models. I've found that the others have the same range of cost, but I wonder what was the final cost at the end of training.

Is it natural for this model to have this kind of high cost at the end of training? or is something wrong with a setting? I've adjusted the parameters in the model, but cost 170 was the best I could get.

I would appreciate your help.

Questions about experiments

Hello, thank you for your code available.
You mentioned about two experiments in your paper, but actually, I don't understand how to do these experiments.
Could you please tell me how to do these experiments clearly?
Thank you!

Cannot able to Interpret Output of npz model File

Hello Ed,
While testing Med2vec to MIMIC database cannot able to Interpret the output of the model file.Whether these are model weights or predicted neighbour visit.

Please Try to clarify my doubt!
Thank You

Output model/weights?

Hi, I read your paper and wanted to ask if the trained (on MIMIC-III) model/W2V can be downloaded directly - I'd like to see and to try it on our internal medical data before setting up resources to try to train a new model. (And training a new model from scratch would take weeks - It's a common practice to share the base model).

thanks!

Where to download the dataset described in your paper? Are they all publicly available?(I open the link provided in your paper but can not find the download link)

Where to download the dataset described in your paper? Are they all publicly available?

Negative Visit Forward Cross-Entropy on MIMIC-III

First of all, thank you for making your code available. This is a very interesting line of research.

In order to better understand your work, I rewrote med2vec in Python with TensorFlow rather than Theano. I then compared my results to yours on the MIMIC-III data set, with the same parameters used, expecting the results to be close. However, I discovered some issues:

In the calculation of visit forward cross-entropy, there are negative values. This leads to some cancellations and hence a visit cost of 300-400. Should negative values be considered here? In Tensorflow, the negative values are mapped to 0, giving a visit cost of ~4,000. What sort of cost values have you seen on MIMIC-III and other data sets?
My emb cost values are roughly equal to yours, but the value is < 10. Since visit cost >> emb cost, won't total cost just optimize for visits and not emb?

Below are some images related to visit cost calculation.

I'd be happy to make my code available to you, if you like. Thank you for your continued work on medical data analytics and I look forward to hearing back from you.

interesting topic

It's quite interesting if this approach could be used for matching the ICD codes based different language version or do sort of machine medical concepts translation based on the term representation.

Where I can find the AHFS classification table?

Hello Choi,

As you mentioned in the paper, you are using AHFS classification to group the NDC codes. I wonder if you still have that mapping table? (Or direct me a some way to find the table)
I am doing a related work but can't find the table anywhere online.

Thank you!

Scatter plot from learned code representations

Hello Ed,

In Med2Vec, after creating the model file, you have created a 2D scatter plot using learned code representations. Is there any grouping is performed between the medical codes after creating the model file for scatter plot?

Because in High charts, the coloring is done based on some grouping.
example:
https://jsfiddle.net/gh/get/library/pure/highcharts/highcharts/tree/master/samples/highcharts/demo/scatter/

I have tried to create scatter plot after performing TSNE on embedding. It is created but there is no grouping, the colors are randomly placed. Cluster does not formed.

Can you please help me in understanding this?

Thanks,
SathickIbrahim

Cost and Weights are NAN

Hello,

During training cost goes to NAN probably because one of the weights becomes too large and data goes out of bounds of float32. This causes all other weights to become NAN as well. I think classic way to deal with is to add Batch Normalization layers which clips large updates to weights however my limited understanding of Theano and your script prevents me from testing it out... Also cost seems quite high- have you seen similar values with your training? Let me know your thoughts on this:

visit representation evaluate result on mimic3

Hello choi, thanks for sharing the code on github, it is a great topic.

After reading several your papers, I have a few questions:

Do you have the visit representation evaluate result on mimic3? Compare with your GRAM model, which one have a better performance? (I ask this because on CHOA, the recall@30 is around 76%, while in GRAM paper on mimic3, the accuracy@20 is relatively low, like 30% on average)
When you learn the vector representation of medical concepts, you want these vector eventually under the same common space. But is it make sense to treat them under the same common space in the first place? for example, you make one dictionary for procedure codes, diagnosis codes and medication codes, and then make one one-hot vector for all these codes.

Thanks

GPU training fails

I am getting an error when trying to do the training on the GPU. There are 47108 unique codes (quite a lot more than in the Mimic) but I am still getting an error even if I am using code and visit representations of just 5 dimensions and batch size of just 1 so I don't believe it is an out of memory problem. That is of course if my math is right:
2 Dense vectors of 47108 doubles: just 1 mb
47108x5 matrix for Dense to Code representation: 2 mb
5x5 matrix for Code represnetation to Visit: 200 bytes

Any help will be appreciated!

Using gpu device 0: Tesla P100-SXM2-16GB (CNMeM is enabled with initial size: 80.0% of memory, cuDNN 5110)
initializing parameters
building models
loading data
training start
[[ 0. 0. 0. ..., 0. 0. 0.]]
Traceback (most recent call last):
File "med2vec.py", line 323, in
train_med2vec(seqFile=args.seq_file, demoFile=args.demo_file, labelFile=args.label_file, outFile=args.out_file, numXcodes=args.n_input
_codes, numYcodes=args.n_output_codes, embDimSize=args.cr_size, hiddenDimSize=args.vr_size, batchSize=args.batch_size, maxEpochs=args.n_ep
och, L2_reg=args.L2_reg, demoSize=args.demo_size, windowSize=args.window_size, logEps=args.log_eps, verbose=args.verbose)
File "med2vec.py", line 290, in train_med2vec
cost = f_grad_shared(x, mask, iVector, jVector)
File "/usr/local/lib/python2.7/dist-packages/theano/compile/function_module.py", line 898, in call
storage_map=getattr(self.fn, 'storage_map', None))
File "/usr/local/lib/python2.7/dist-packages/theano/gof/link.py", line 325, in raise_with_op
reraise(exc_type, exc_value, exc_trace)
File "/usr/local/lib/python2.7/dist-packages/theano/compile/function_module.py", line 884, in call
self.fn() if output_subset is None else
RuntimeError: Cuda error: GpuElemwise node_ea5fafbdcfd074e674342684e5c33a10_0 Exp: an illegal memory access was encountered.
n_blocks=30 threads_per_block=256
Call: kernel_Exp_node_ea5fafbdcfd074e674342684e5c33a10_0_Ccontiguous<<<n_blocks, threads_per_block>>>(numEls, i0_data, o0_data)
Apply node that caused the error: GpuElemwise{Exp}(0, 0)
Toposort index: 35
Inputs types: [CudaNdarrayType(float32, matrix)]
Inputs shapes: [(47108, 47108)]
Inputs strides: [(47108, 1)]
Inputs values: ['not shown']
Outputs clients: [[GpuCAReduce{add}{0,1}(GpuElemwise{Exp}[(0, 0)].0), GpuElemwise{Mul}[(0, 1)](GpuDimShuffle{0,x}.0, GpuElemwise{Exp}[(0,
0)].0)]]

Edit: I also checked that 47108 codes do not cause problems by using only 10 codes and adding x[idx][np.array(seq)%numXcodes] = 1. in padMatrix

How to make demo.txt

Hi Ed,

I appreciate the code you provided . But there is a problem that makes me confused.

According to Step3.5 , Could you please tell me some details about how to make 'demo.txt' and add this codes about how to create to the file process_mimic.py

Thanks!

output file

Hi Edward,
I have run your code on my dataset and got the .npz files. I find it contains 6 numpy.array variables W_emb b_output b_hidden b_emb W_output W_hidden. But on the GitHub Repo I can’t find further description about these output variables. Can you give some detailed instruction about the output? How can I get the code and visit representation?

TyperError: Expected Variable, got odict values

Hi, thank you for the awesome paper. I've been very interested in getting med2vec up and running using just the 3 required parameters for starters.

With my pickled sequence list looking like: [[1,2,3], [4,5,6,7], [-1], [2,4], [8,3,1], [3]]
I get the following error:

File "test_med2vec.py", line 248, in train_med2vec
grads = T.grad(cost, wrt=tparams.values())

TypeError: Expected Variable, got odict_values([W_emb, b_emb, W_hidden, b_hidden, W_output, b_output]) of type <class 'odict_values'>

Here's my run syntax:

python3 test_med2vec.py 'seq.pkl' 8 'med2vec_fin'

Any idea why it's not liking the dictionary values? Thank you for your time, if you're able to help.

questions about the training data format

hi there, thanks a lot for making the code available, it helps me a lot to understand you paper.
I have a question about the format of training data. In README.md, step 3, when describing how to prepare the training data, each visit is said to be represented by a list of integers, such as [5,8,13]. In the closed issue "'output file" (#7), you answered TheodoreZhao's question, and said that "For visit representation, you can derive the code-level representation using u_t = ReLU(W_emb x_t + b_emb), possibly with a multi-hot vector, then use v_t = ReLU(W_hidden u_t + b_hidden) to derive the visit representation ".

so my question is, if I represent a visit with a list of integers in the training step, and compute a visit representation based on a multi-hot vector, what's the relationship between the list and the multi-hot vector? how to get the multi-hot vector for each visit?

thanks.

Negative Code Embeddings

Hello Ed,

In the Med2Vec code , you have mentioned the weights as (-0.01,0.01) ,which is generating negative code embeddings.
params['W_emb'] = np.random.uniform(-0.01, 0.01,

however in the paper you have mentioned that "all medical codes C to non-negative real-valued vectors of dimension m"

Can you please help me in understanding that?

Thanks,
Ankit

How to tune parameters to avoid cost:nan?

Using our own data from EHR and default parameters of med2vec, the cost went nan in epoch 1. Which parameter should I adujst to avoid such things happen? Enhance L2 or set a bigger log_eps? We have in total over 100 thousand batches, do we need to set a bigger batch_size?

NaN gradient may be due to weight initialization

Hi Ed,

I saw in your code, the weights are initialized with truncated normal distribution. When I ran it, it seemed in the medical-code-loss part, this produced large values feeding to exp and resulted in inf in the loss and NaN gradients. Also because of such initial weights, the loss in general is pretty high around several hundreds, especially L2 loss is around tens of thousands. Then I changed the weight initialization to be uniform with a small interval [-0.1, 0.1]. That seems to produce reasonable magnitude of loss (under 10). I wonder if you still remember whether you have tried other weight initializations and how they impact the results.

Another question I have is that in the paper, the loss is averaged over T. Is this T visits in the batch or visits per patient? In your code, it seems, your ivec and jvec are generated for the batch. So in the medical-code-loss calculation, it is averaging over all visits in a batch, instead of averaging per patient and then averaging over all patients in a batch?

Thanks!

Epochs and loss during training

Hi Ed,

I am training embedding using your default hyperparameters, except window_size. The minimum number of visits in my dataset is 2, but I set window_size=3 as I suppose your code can handle the inconsistency between window_size and actual sequence length. Am I right?

I also noticed that the mean_cost was the minimum at the 2nd epoch then it started increasing. Although I read in your paper that the number of epochs does not hurt the code representations very much, I am not sure which epoch should I choose after finished training. Should I used the minimum cost one, or the one from the last epoch?

Interpretation of learned representations

Hello Ed @mp2893,

This is super interesting work! I have two questions regarding the interpretation of learned representations.

In Section 3.5 - Interpreting code representations, the top k medical codes from each embedding dimension are selected to check if they are clinically related. However, by using skip-gram, according to this post, I think we should use cosine similarity to group medical codes but not the magnitudes of the values on specific embedding dimension. I think it is the angle between different medical codes that matter, not the magnitudes of the values on specific dimensions.
I have a similar question regarding interpreting visit representations. Specifically, why is it meaningful to compare the magnitudes of a specific dimension in the visit embedding space?

Thank you very much!

Can you give some data for test and detail them?

for example, in [5,8,15],code 5 at a certain visit is what? and 8 , 15? thanks!

Mapping embeddings to ICD codes

Hi Dr. Choi,

Thank you for sharing your work on Github!

Could you please tell me where I find the mapping between the ICD codes and the embeddings? I was testing med2vec on demo MIMIC data and W_emb is an array of dimensions 4894*200, where 4894 are the unique ICD codes, could you please let me know where I can find the mapping between W_emb and the corresponding ICD code names?

Thanks a lot!

Questions about complexity analysis

Hi Ed,

As mentioned in your paper, "Therefore the complexity of Med2Vec is dominated by the code representation learning process, for which we use the Skip-gram algorithm".

I know you use grouper/parent codes to decrease the complexity of visit-level learning process. But it seems that you didn't do much on the code-level part.

Is there a reason why you do not use methods like negative sampling to decrease the complexity of code level learning process?

Thanks
Xianlong

mp2893 / med2vec Goto Github PK

med2vec's Introduction

Med2Vec

Relevant Publications

Running Med2Vec

med2vec's People

Contributors

Stargazers

Watchers

Forkers

med2vec's Issues

Recommend Projects

Recommend Topics

Recommend Org