Giter Site home page Giter Site logo

norbert's Introduction

🔭 Postdoc at Karlsruhe Institute of Technology (KIT)

I am researching in the domain of Natural Language Processing (NLP) for Software Engineering (SE) with focus on traceability link recovery and requirements engineering.

In my dissertation, I developed the automated traceability link recovery approach FTLR, that is able to relate requirements to their corresponding source code entities by utilizing fine-grained word embedding-based relations. Furthermore, I’ve developed the requirements classification approach NoRBERT and integrated its results as a filter into FTLR.

Previously, I’ve done research on programming in natural language, mapping natural language instructions to their corresponding API calls.

norbert's People

Contributors

gram21 avatar norbert-one avatar norbert-two avatar tobhey avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar

norbert's Issues

Question regarding Dataset directory?

Hi @tobhey,

I read the paper, great job!
Basically this is not an issue related to the models or any code provided in the project.
I have a question regarding the dataset directory available on this projects dataset.
Is it the whole dataset used for the project or just the split used for testing the models?

Thanks in advance!

ValueError: setting an array element with a sequence. The requested array has an inhomogeneous shape after 1 dimensions. The detected shape was (207,) + inhomogeneous part.

I can't understand why on suddenly it went wrong in Task2_3_Multiclass_classification_of_NFR_subclasses.ipynb during training on the classifier! If there is any solution please provide me with one. I have uploaded the image of error as well as particular cell for which error was generated.

Image :
1

Cell : Decide how to fold and train the classifier
Code snippet :

overall_flat_predictions, overall_flat_true_labels, results = [], [], []
initLog()
if config.fold == Fold.TenFold:
  skf = StratifiedKFold(n_splits=10)
  fold_number = 1
  for train, test in skf.split(df, df[config_data.label_column]):
    df_train = df.iloc[train]
    df_eval = df.iloc[test]
    log_text = '/////////////////////// Fold: {} of {} /////////////////////////////'.format(fold_number,10)
    logLine(log_text)
    classifier, overall_flat_predictions, overall_flat_true_labels, results = train_and_predict(df_train, df_eval, overall_flat_predictions, overall_flat_true_labels, results)
    fold_number = fold_number + 1
elif config.fold == Fold.ProjFold:     
  for k in config_data.project_fold:
    test = df.loc[df['ProjectID'].isin(k)].index
    train = df.loc[~df['ProjectID'].isin(k)].index
    df_train = df.loc[train]
    df_eval = df.loc[test]
    log_text = '/////////////////////// Test-Projects: {} /////////////////////////////'.format(k)
    logLine(log_text)
    classifier, overall_flat_predictions, overall_flat_true_labels, results = train_and_predict(df_train, df_eval, overall_flat_predictions, overall_flat_true_labels, results)
else:
  df_train, df_eval = train_test_split(df,stratify=df[config_data.label_column], train_size=config.train_size, random_state= config.seed)
  classifier, overall_flat_predictions, overall_flat_true_labels, results = train_and_predict(df_train, df_eval, overall_flat_predictions, overall_flat_true_labels, results)

get_memory_usage_str() 

Error :

Train Dataframe shape: (332, 18)
Evaluation Dataframe shape: (37, 18)
/usr/lib/python3.10/multiprocessing/popen_fork.py:66: RuntimeWarning: os.fork() was called. os.fork() is incompatible with multithreaded code, and JAX is multithreaded, so this will likely lead to a deadlock.
  self.pid = os.fork()
/usr/lib/python3.10/multiprocessing/popen_fork.py:66: RuntimeWarning: os.fork() was called. os.fork() is incompatible with multithreaded code, and JAX is multithreaded, so this will likely lead to a deadlock.
  self.pid = os.fork()
---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
[<ipython-input-21-b2bf3dae31b4>](https://localhost:8080/#) in <cell line: 29>()
     35     log_text = '/////////////////////// Fold: {} of {} /////////////////////////////'.format(fold_number,10)
     36     logLine(log_text)
---> 37     classifier, overall_flat_predictions, overall_flat_true_labels, results = train_and_predict(df_train, df_eval, overall_flat_predictions, overall_flat_true_labels, results)
     38     fold_number = fold_number + 1
     39 elif config.fold == Fold.ProjFold:

10 frames
[/usr/local/lib/python3.10/dist-packages/fastai/core.py](https://localhost:8080/#) in array(a, dtype, **kwargs)
    300     if np.int_==np.int32 and dtype is None and is_listy(a) and len(a) and isinstance(a[0],int):
    301         dtype=np.int64
--> 302     return np.array(a, dtype=dtype, **kwargs)
    303 
    304 class EmptyLabel(ItemBase):

ValueError: setting an array element with a sequence. The requested array has an inhomogeneous shape after 1 dimensions. The detected shape was (249,) + inhomogeneous part.

Class Modules

Hi,

Thank you for the well-written article and the structured artifacts provided. I am currently working on a similar project in software engineering. I want to use NoRBERT as part of my framework.
I am interested in task 1 but I can't manage to install/import all the dependencies. for the code cell below, I can't somehow import the necessary libraries so the names for NoRBERT classes are not defined. can you help me with that please?

Thanks in advance

#@title Load/Define NoRBERT classes {display-mode: "form"}
class FastAiBertTokenizer(BaseTokenizer):
    """Wrapper around BertTokenizer to be compatible with fast.ai"""
    def __init__(self, tokenizer: BertTokenizer, max_seq_len: int=512, **kwargs):
        self._pretrained_tokenizer = tokenizer
        self.max_seq_len = max_seq_len

    def __call__(self, *args, **kwargs):
        return self

    def tokenizer(self, t:str):
        """Limits the maximum sequence length. Prepend with [CLS] and append [SEP]"""
        return ["[CLS]"] + self._pretrained_tokenizer.tokenize(t)[:self.max_seq_len - 2] + ["[SEP]"]

## 

class BertTokenizeProcessor(TokenizeProcessor):
    """Special Tokenizer, where we remove sos/eos tokens since we add that ourselves in the tokenizer."""
    def __init__(self, tokenizer):
        super().__init__(tokenizer=tokenizer, include_bos=False, include_eos=False)

class BertNumericalizeProcessor(NumericalizeProcessor):
    """Use a custom vocabulary to match the original BERT model."""
    def __init__(self, *args, **kwargs):
        super().__init__(*args, vocab=Vocab(list(bert_tok.vocab.keys())), **kwargs)

def get_bert_processor(tokenizer:Tokenizer=None, vocab:Vocab=None):
    return [BertTokenizeProcessor(tokenizer=tokenizer),
            NumericalizeProcessor(vocab=vocab)]

class BertDataBunch(TextDataBunch):
    @classmethod
    def from_df(cls, path:PathOrStr, train_df:DataFrame, valid_df:DataFrame, test_df:Optional[DataFrame]=None,
              tokenizer:Tokenizer=None, vocab:Vocab=None, classes:Collection[str]=None, text_cols:IntsOrStrs=1,
              label_cols:IntsOrStrs=0, **kwargs) -> DataBunch:
        "Create a `TextDataBunch` from DataFrames."
        p_kwargs, kwargs = split_kwargs_by_func(kwargs, get_bert_processor)
        # use our custom processors while taking tokenizer and vocab as kwargs
        processor = get_bert_processor(tokenizer=tokenizer, vocab=vocab, **p_kwargs)
        if classes is None and is_listy(label_cols) and len(label_cols) > 1: classes = label_cols
        src = ItemLists(path, TextList.from_df(train_df, path, cols=text_cols, processor=processor),
                      TextList.from_df(valid_df, path, cols=text_cols, processor=processor))
        src = src.label_for_lm() if cls==TextLMDataBunch else src.label_from_df(cols=label_cols, classes=classes)
        if test_df is not None: src.add_test(TextList.from_df(test_df, path, cols=text_cols))
        return src.databunch(**kwargs)

##

class BertTextClassifier(BertPreTrainedModel):
    def __init__(self, model_name, num_labels):
        config = BertConfig.from_pretrained(model_name)
        super(BertTextClassifier, self).__init__(config)
        self.num_labels = num_labels
        
        self.bert = BertModel.from_pretrained(model_name, config=config)
        
        self.dropout = nn.Dropout(self.config.hidden_dropout_prob)
        self.classifier = nn.Linear(self.config.hidden_size, num_labels)

        #self.apply(self.init_weights)
    
    def forward(self, tokens, labels=None, position_ids=None, token_type_ids=None, attention_mask=None, head_mask=None):
        outputs = self.bert(tokens, position_ids=position_ids, token_type_ids=token_type_ids, attention_mask=attention_mask, head_mask=head_mask)
        
        pooled_output = outputs[1]
        # According to documentation of pytorch-transformers, pooled output might not be the best 
        # and you’re often better with averaging or pooling the sequence of hidden-states for the whole input sequence 
        #hidden_states = outputs[0]
        #pooled_output = torch.mean(hidden_states, 1)

        dropout_output = self.dropout(pooled_output)
        logits = self.classifier(dropout_output)

        activation = nn.Softmax(dim=1)
        probs = activation(logits)   

        return logits

Unable to load Pickle file

Hi, thanks for uploading the ipython notebooks! I had a query -- I'm unable to load the pickle file, I'm getting the error : UnpicklingError: NEWOBJ class argument isn't a type object

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.