Its more of a question, than an issue, regarding the following use case: <p dir="a

<a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="/us

Update Strings Job/Task to process Directory evidence types about turbinia HOT 11 OPEN

HolzmanoLagrene commented on May 25, 2024

Update Strings Job/Task to process Directory evidence types

from turbinia.

Comments (11)

aarontp commented on May 25, 2024 1

I think there are a few options depending on what you want to do exactly. Do the strings change or are they the same every time? Here are a few options:

Add TextFile as an input evidence type in turbiniactl. There's not really a reason it's not already exposed, I just don't think it's come up as a use-case yet. This is probably the easiest (few lines of code), but probably not very efficient as you'd have to submit a new request for every single file (or a bulk request with lots of files). I think the new API client will actually have this evidence type exposed by default.
Change the strings job to be able to process Directory or CompressedDirectory evidence types and then the grep job would get the output from that. This wouldn't be too hard as you'd just have to change the logic in the task code to check whether the source/local path is a directory and walk it if it is. It could probably be put into the same output file, but would need different strings flags to print out the text file name.
If the strings are always the same, you could also put yara rules here https://github.com/google/turbinia/tree/master/turbinia/config/rules and it will automatically get picked up by the yara job , though you might still have to change what is allowed for evidence input to the yara job which is easy to do.
If you have other heuristics that you wanted to use to pre-filter some of the files you could also write a custom Task (https://turbinia.readthedocs.io/en/latest/developer/developing-new-tasks.html).

The second option is probably the best long term IMO. Let me know if that's something you might be interested in contributing towards and I'd be happy to help. Otherwise we can put in a feature request. Hope that helps and let me know if you have any questions about those options.

from turbinia.

aarontp commented on May 25, 2024 1

I think we can just keep this issue and rename it so that it has the context/history that might help with writing the feature. I'll rename it for now and we can go from there. When you are ready to take a look at an implementation and after you've looked at the existing code a bit, maybe you can write up your ideas/questions here again just so we can give some feedback/help before you implement it. If you haven't already, I suggest reading through the developing new task documentation) even though you aren't creating a new task as it will help give you some context for how the code is put together.

from turbinia.

HolzmanoLagrene commented on May 25, 2024

Wow thanks so much for your answer. I completely agree with you! I could contribute towards an implementation of option number 2 in a couple of weeks. I will get back to this topic as soon as I have more time. I think it is best to close this issue and open a fresh one including the actual implementation idea?

from turbinia.

HolzmanoLagrene commented on May 25, 2024

As mentioned above, the objective is to alter the strings-job in such a way, that it will be able to process Directory-, CompressedDirectory- and TextFile-Evidence. And while strings easily can be applied to TextFile-Evidence - and CompressedDirectory first has to be extracted anyway - , we are left with the problem of the Directory-Evidence.
Which when thinking about it, entails multiple fundamental questions to be answered first:

How should the output look like?

Strings as it is used now prints the offset, so that the location of the string can be determined. But how do we deal with that when multiple files are used? I don't think this can be handled using the command line as strings does not provide this functionality.

How should strings be used?

Multiple options/levels can be used to apply the strings command to multiple files:

Enumerate all files and apply them in serial to strings on the command line

 find evidence_path -type f | xargs strings &> result.out

or to improve the speed a little bit

 find evidence_path -type f | xargs -n 1 -P 8 strings &> result.out

Enumerate all files in python and then apply them to strings

import glob

files = []
for file in glob.iglob("evidence.device_path/**/**",recursive=True):
    files.append(file)
cmd = "strings -a -t d -e l {0:s} > {1:s}".format(" ".join(files),"result.out")

Enumerate all files in python and run the execute function multiple times for each file

This will create multiple result files. Which could be nice if it is actually wanted (see above).

import glob, os

files = []
for file in glob.iglob("evidence.device_path/**/**",recursive=True):
    base_name = os.path.basename(file)
    output_file_path = os.path.join(self.output_dir, '{0:s}.uni'.format(file))
    output_evidence = TextFile(source_path=output_file_path)
    cmd = "strings -a -t d -e l {0:s} > {1:s}".format(file,output_evidence)
    self.execute(cmd, result, new_evidence=[output_evidence], close=True, shell=True)

Handle it with Turbinia

Maybe its worth thinking about creating a TurbiniaTask for each file in a folder. But as I understand, there is no "Extract"-Job, that takes every folder, creates TextFile-Evidence for every file and feeds it back to Turbinia. So the handling of the folder would need to be done by the StringsJob itself. But maybe it is an option to extract and enumerate the directory in the Job, create new Evidence for each file and create a new Task for each of it.

This could look something like this:

tasks = []
for _ in evidence:
    for file in glob.iglob(evidence.device_path):
        file_evidence = TextFile(source_path=file)

This solution would let Turbinia handle the Workload, but Im not sure if this really is possible?

How do we make sure strings has a decent performance?

Again this can be handled in different ways. I did a few tests by creating sample data:

mkdir sample_data
for i in {1..10}; do head -c 50M </dev/urandom >sample_data/test$i.data; done

time strings sample_data/* > ../result.out
Real Time: 0m5.495s

time find . -type f | xargs -n 1 -P 4 strings  > ../result.out
Real Time: 0m1.766s

Im sure there are other advanced possibilities to increase the speed of the strings command, but this illustrates that there are ways to improve speed.

What do you guys think is the way to go? Any thoughts?
Ps: Python Scripts are for illustration purposes only :-) not sure whether they actually work.

from turbinia.

aarontp commented on May 25, 2024

Sorry I missed this comment while I was out -- I'll take a closer look at this and will respond tomorrow.

from turbinia.

HolzmanoLagrene commented on May 25, 2024

No worries. Just had another idea that we could take into consideration:
I just noticed that bulk_extractor, who very well handles files as well as disks can take a patternfile with the -F-flag. Could be another approach for my initial problem.

from turbinia.

aarontp commented on May 25, 2024

Hello:

A few thoughts about your comments above:

Regarding How should the output look like?: This might be a bit of a personal preference for people, but I think we could put everything in the same file with the filename prefix at the start of each strings line, similar to how the offset data is at the start of the lines for the existing strings output.
As for how strings is used: I think it would be better to have the file enumeration code in python rather than a shell one-liner just to keep things as system/shell agnostic as possible and to minimize dependencies, plus I think it's a little cleaner that way. I don't think we should try to run strings against a single large list of the file names as we could run into shell or argv limitations with large directories that way. While it would be possible to create a new Task for each file, I think that would just add extra overhead and inflate the number of tasks too much so the output would be hard to read with turbiniactl status (we actually have a goal to combine some of the file extraction tasks into the analysis tasks that use the output from them as well).
Regarding performance: I don't have much input here, so if you have any experiments that show one way is better than another I'm pretty open to whatever works best :).
Bulk extractor may indeed be a better way to go here, and we do have a job for that already. Maybe @dfjxs who has worked with this quite a bit has some input here.

Hope that helps and sorry for the delay.

from turbinia.

aarontp commented on May 25, 2024

Also, I'm not sure if you're already on the Slack channel or not, but there is a #Turbinia channel on the Open Source DFIR Slack, so feel free to ping me there if you wanted to chat directly in closer to real time.

from turbinia.

aarontp commented on May 25, 2024

@dfjxs any thoughts on using BulkExtractor for Directory evidence types to get strings on all files? Any chance this will just work out of the box? :)

from turbinia.

dfjxs commented on May 25, 2024

Yes, I think this may just work out of the box. Would just need to provide the right parameters to the Bulk Extractor task.

from turbinia.

HolzmanoLagrene commented on May 25, 2024

Ok guys i finally found the time to work on this. Very sorry for the enormous delay. Okay here are my thoughts:

Input Types

For now only the following types are enabled for the BulkExtractorJob

turbinia/turbinia/jobs/bulk_extractor.py

Lines 37 to 39 in da3d799

    
           evidence_input = [ 
        
               RawDisk, GoogleCloudDisk, GoogleCloudDiskRawEmbedded, EwfDisk 
        
           ]

However bulk_extractor is not only able to handle directories but basically any binary data blog such as e.g. zips or textfiles by default. However, if a directory is passed to it, the -R parameter has to be given as parameter.

As the job already allows almost any parameter to be handed to the BulkExtractorJob via the bulk_extractor_args i suggest not to add more complexity for passing parameters but instead do the following two changes:

BulkExtractorJob: Alter the Input Types to also allow Directory and CompressedDirectory

BulkExtractorTask Change the command creation specified here, so that it adds -R if a Directory is passed to the bulk_extractor

turbinia/turbinia/workers/bulk_extractor.py

Lines 78 to 85 in da3d799

    
           cmd = ['bulk_extractor'] 
        
           cmd.extend(['-o', output_file_path]) 
        
           if bulk_extractor_args: 
        
             cmd.extend(bulk_extractor_args) 
        
           cmd.append(evidence.local_path)

Parameters

As stated above all config can be made with the bulk_extractor_args. However passing a regex-pattern file would need its own parameter in order to actually create such file and hand it to the binary. I would suggest to handle that similar to the GrepTask.

turbinia/turbinia/workers/grep.py

Lines 42 to 47 in da3d799

    
           patterns = self.task_config.get('filter_patterns') 
        
           if not patterns: 
        
             result.close(self, success=True, status='No patterns supplied, exit task') 
        
             return result 
        
           patterns_file_path = write_list_to_temp_file(patterns)

Therefore change the TASK_CONFIG of the BulkExtractorTask as follows

  TASK_CONFIG = {
      'regex_patterns': []
      'bulk_extractor_args': None
  }

What do you guys think of this idea?!

from turbinia.

Update Strings Job/Task to process Directory evidence types about turbinia HOT 11 OPEN

Comments (11)

How should the output look like?

How should strings be used?

Enumerate all files and apply them in serial to strings on the command line

Enumerate all files in python and then apply them to strings

Enumerate all files in python and run the execute function multiple times for each file

Handle it with Turbinia

How do we make sure strings has a decent performance?

Input Types

Parameters

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent

	evidence_input = [
	RawDisk, GoogleCloudDisk, GoogleCloudDiskRawEmbedded, EwfDisk
	]

	cmd = ['bulk_extractor']

	cmd.extend(['-o', output_file_path])

	if bulk_extractor_args:
	cmd.extend(bulk_extractor_args)

	cmd.append(evidence.local_path)

	patterns = self.task_config.get('filter_patterns')
	if not patterns:
	result.close(self, success=True, status='No patterns supplied, exit task')
	return result

	patterns_file_path = write_list_to_temp_file(patterns)