tobhey / norbert Goto Github PK

Requirements classification approach using BERT's transfer learning capabilities

License: GNU General Public License v3.0

Jupyter Notebook 100.00%

norbert's Introduction

🔭 Postdoc at Karlsruhe Institute of Technology (KIT)

I am researching in the domain of Natural Language Processing (NLP) for Software Engineering (SE) with focus on traceability link recovery and requirements engineering.

In my dissertation, I developed the automated traceability link recovery approach FTLR, that is able to relate requirements to their corresponding source code entities by utilizing fine-grained word embedding-based relations. Furthermore, I’ve developed the requirements classification approach NoRBERT and integrated its results as a filter into FTLR.

Previously, I’ve done research on programming in natural language, mapping natural language instructions to their corresponding API calls.

norbert's People

Contributors

Stargazers

Watchers

Forkers

geetika016 daesungwang jasonchan117 nharikrishnan keshava surekhamedapati arunasankar asantos2000 khomiakov7706 aegunasekaran yixingluo

norbert's Issues

Question regarding Dataset directory?

Hi @tobhey,

I read the paper, great job!
Basically this is not an issue related to the models or any code provided in the project.
I have a question regarding the dataset directory available on this projects dataset.
Is it the whole dataset used for the project or just the split used for testing the models?

Thanks in advance!

ValueError: setting an array element with a sequence. The requested array has an inhomogeneous shape after 1 dimensions. The detected shape was (207,) + inhomogeneous part.

I can't understand why on suddenly it went wrong in Task2_3_Multiclass_classification_of_NFR_subclasses.ipynb during training on the classifier! If there is any solution please provide me with one. I have uploaded the image of error as well as particular cell for which error was generated.

Image :

Cell : Decide how to fold and train the classifier
Code snippet :

overall_flat_predictions, overall_flat_true_labels, results = [], [], []
initLog()
if config.fold == Fold.TenFold:
  skf = StratifiedKFold(n_splits=10)
  fold_number = 1
  for train, test in skf.split(df, df[config_data.label_column]):
    df_train = df.iloc[train]
    df_eval = df.iloc[test]
    log_text = '/////////////////////// Fold: {} of {} /////////////////////////////'.format(fold_number,10)
    logLine(log_text)
    classifier, overall_flat_predictions, overall_flat_true_labels, results = train_and_predict(df_train, df_eval, overall_flat_predictions, overall_flat_true_labels, results)
    fold_number = fold_number + 1
elif config.fold == Fold.ProjFold:     
  for k in config_data.project_fold:
    test = df.loc[df['ProjectID'].isin(k)].index
    train = df.loc[~df['ProjectID'].isin(k)].index
    df_train = df.loc[train]
    df_eval = df.loc[test]
    log_text = '/////////////////////// Test-Projects: {} /////////////////////////////'.format(k)
    logLine(log_text)
    classifier, overall_flat_predictions, overall_flat_true_labels, results = train_and_predict(df_train, df_eval, overall_flat_predictions, overall_flat_true_labels, results)
else:
  df_train, df_eval = train_test_split(df,stratify=df[config_data.label_column], train_size=config.train_size, random_state= config.seed)
  classifier, overall_flat_predictions, overall_flat_true_labels, results = train_and_predict(df_train, df_eval, overall_flat_predictions, overall_flat_true_labels, results)

get_memory_usage_str()

Error :

Train Dataframe shape: (332, 18)
Evaluation Dataframe shape: (37, 18)
/usr/lib/python3.10/multiprocessing/popen_fork.py:66: RuntimeWarning: os.fork() was called. os.fork() is incompatible with multithreaded code, and JAX is multithreaded, so this will likely lead to a deadlock.
  self.pid = os.fork()
/usr/lib/python3.10/multiprocessing/popen_fork.py:66: RuntimeWarning: os.fork() was called. os.fork() is incompatible with multithreaded code, and JAX is multithreaded, so this will likely lead to a deadlock.
  self.pid = os.fork()
---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
[<ipython-input-21-b2bf3dae31b4>](https://localhost:8080/#) in <cell line: 29>()
     35     log_text = '/////////////////////// Fold: {} of {} /////////////////////////////'.format(fold_number,10)
     36     logLine(log_text)
---> 37     classifier, overall_flat_predictions, overall_flat_true_labels, results = train_and_predict(df_train, df_eval, overall_flat_predictions, overall_flat_true_labels, results)
     38     fold_number = fold_number + 1
     39 elif config.fold == Fold.ProjFold:

10 frames
[/usr/local/lib/python3.10/dist-packages/fastai/core.py](https://localhost:8080/#) in array(a, dtype, **kwargs)
    300     if np.int_==np.int32 and dtype is None and is_listy(a) and len(a) and isinstance(a[0],int):
    301         dtype=np.int64
--> 302     return np.array(a, dtype=dtype, **kwargs)
    303 
    304 class EmptyLabel(ItemBase):

ValueError: setting an array element with a sequence. The requested array has an inhomogeneous shape after 1 dimensions. The detected shape was (249,) + inhomogeneous part.

Class Modules

Hi,

Thank you for the well-written article and the structured artifacts provided. I am currently working on a similar project in software engineering. I want to use NoRBERT as part of my framework.
I am interested in task 1 but I can't manage to install/import all the dependencies. for the code cell below, I can't somehow import the necessary libraries so the names for NoRBERT classes are not defined. can you help me with that please?

Thanks in advance

#@title Load/Define NoRBERT classes {display-mode: "form"}
class FastAiBertTokenizer(BaseTokenizer):
    """Wrapper around BertTokenizer to be compatible with fast.ai"""
    def __init__(self, tokenizer: BertTokenizer, max_seq_len: int=512, **kwargs):
        self._pretrained_tokenizer = tokenizer
        self.max_seq_len = max_seq_len

    def __call__(self, *args, **kwargs):
        return self

    def tokenizer(self, t:str):
        """Limits the maximum sequence length. Prepend with [CLS] and append [SEP]"""
        return ["[CLS]"] + self._pretrained_tokenizer.tokenize(t)[:self.max_seq_len - 2] + ["[SEP]"]

## 

class BertTokenizeProcessor(TokenizeProcessor):
    """Special Tokenizer, where we remove sos/eos tokens since we add that ourselves in the tokenizer."""
    def __init__(self, tokenizer):
        super().__init__(tokenizer=tokenizer, include_bos=False, include_eos=False)

class BertNumericalizeProcessor(NumericalizeProcessor):
    """Use a custom vocabulary to match the original BERT model."""
    def __init__(self, *args, **kwargs):
        super().__init__(*args, vocab=Vocab(list(bert_tok.vocab.keys())), **kwargs)

def get_bert_processor(tokenizer:Tokenizer=None, vocab:Vocab=None):
    return [BertTokenizeProcessor(tokenizer=tokenizer),
            NumericalizeProcessor(vocab=vocab)]

class BertDataBunch(TextDataBunch):
    @classmethod
    def from_df(cls, path:PathOrStr, train_df:DataFrame, valid_df:DataFrame, test_df:Optional[DataFrame]=None,
              tokenizer:Tokenizer=None, vocab:Vocab=None, classes:Collection[str]=None, text_cols:IntsOrStrs=1,
              label_cols:IntsOrStrs=0, **kwargs) -> DataBunch:
        "Create a `TextDataBunch` from DataFrames."
        p_kwargs, kwargs = split_kwargs_by_func(kwargs, get_bert_processor)
        # use our custom processors while taking tokenizer and vocab as kwargs
        processor = get_bert_processor(tokenizer=tokenizer, vocab=vocab, **p_kwargs)
        if classes is None and is_listy(label_cols) and len(label_cols) > 1: classes = label_cols
        src = ItemLists(path, TextList.from_df(train_df, path, cols=text_cols, processor=processor),
                      TextList.from_df(valid_df, path, cols=text_cols, processor=processor))
        src = src.label_for_lm() if cls==TextLMDataBunch else src.label_from_df(cols=label_cols, classes=classes)
        if test_df is not None: src.add_test(TextList.from_df(test_df, path, cols=text_cols))
        return src.databunch(**kwargs)

##

class BertTextClassifier(BertPreTrainedModel):
    def __init__(self, model_name, num_labels):
        config = BertConfig.from_pretrained(model_name)
        super(BertTextClassifier, self).__init__(config)
        self.num_labels = num_labels
        
        self.bert = BertModel.from_pretrained(model_name, config=config)
        
        self.dropout = nn.Dropout(self.config.hidden_dropout_prob)
        self.classifier = nn.Linear(self.config.hidden_size, num_labels)

        #self.apply(self.init_weights)
    
    def forward(self, tokens, labels=None, position_ids=None, token_type_ids=None, attention_mask=None, head_mask=None):
        outputs = self.bert(tokens, position_ids=position_ids, token_type_ids=token_type_ids, attention_mask=attention_mask, head_mask=head_mask)
        
        pooled_output = outputs[1]
        # According to documentation of pytorch-transformers, pooled output might not be the best 
        # and you’re often better with averaging or pooling the sequence of hidden-states for the whole input sequence 
        #hidden_states = outputs[0]
        #pooled_output = torch.mean(hidden_states, 1)

        dropout_output = self.dropout(pooled_output)
        logits = self.classifier(dropout_output)

        activation = nn.Softmax(dim=1)
        probs = activation(logits)   

        return logits

Unable to load Pickle file

Hi, thanks for uploading the ipython notebooks! I had a query -- I'm unable to load the pickle file, I'm getting the error : UnpicklingError: NEWOBJ class argument isn't a type object

tobhey / norbert Goto Github PK

norbert's Introduction

norbert's People

Contributors

Stargazers

Watchers

Forkers

norbert's Issues

Question regarding Dataset directory?

ValueError: setting an array element with a sequence. The requested array has an inhomogeneous shape after 1 dimensions. The detected shape was (207,) + inhomogeneous part.

Class Modules

Unable to load Pickle file

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent