This is my project on language models for legal text generation.
You can create the pandas dataframe using the bulk data provided by 'courtlistener'. To generate your own dataset see the 'create_dataset.py' file. You can adjust the size by setting a different k value. I used 50.000 as mentioned in the paper.
Further, you can do the cleaning using the 'preprocessing.py' file.
To create the sentence boundary detection use the 'embedding.py' file.
See the 'fine_tuning.py' file to fine tune your model.
To reproduce the plots in the paper use the 'plots.py' document