WO 2010029346 20100318
IMPROVEMENTS IN OR RELATING TO DIGITAL FORENSICS
The present disclosure relates to improvements in or relating to digital forensics, and in particular to new methods and apparatus for digital forensic analysis of digital computing systems.
Locard's Exchange Principle best describes the fundamental theory of Forensic Science; that it is impossible to commit a crime without leaving a trace. Due to this interchange, it is possible for this evidence to be collected and analysed to establish the cause of an incident. The discipline of forensic science relates specifically to a scientific methodology to collect, preserve and analyse this information.
Forensic science has been developing ever since the late 18th Century, when the original forensic scientists practiced medicine, usually in order to analyse the cause of disease, by performing autopsies. The use of forensic analysis for criminal investigations was derived from this medical background, due to the teaching of thorough, evidence-based reasoning, leading to the development of fingerprint evidence as well as contemporary specialisations that have evolved, such as ballistics, DNA analysis, toxicology, and so on.
Although digital computing devices and systems process and store virtual, not physical material, Locard's principle still applies. The discipline of digital forensics is well established, and is starting to diverge into specialties. The term "digital forensics" is understood herein to refer to investigative or analytical activity relating to any digital computing system ("DCS"), where a DCS is any device that manipulates, stores or otherwise processes digital information. For example, computers of all types, mobile telephones, personal digital assistants (PDA's), media players, set-top
boxes, games consoles, televisions, and all associated network components such as routers, switches, hubs, servers, and broadcast equipment, are encompassed by the term DCS. This list is not exhaustive and is provided for purposes of illustration only.
In the early days of digital forensics, the methodologies that were used in traditional forensics fields were directly applicable. The evidence gathered from computer systems were generally indicative of tangible crimes committed outside of the computer environment. Computers themselves were rarely the target for criminal activity, but instead were used to store evidence relating to crimes, such as fraud. However with the increasing importance and popularity of computers, IT systems themselves are, increasingly, the target of criminal activity. It is now possible for crimes to be committed in virtual domains, with electronic and logical trails of evidence relating to events that only occurred in an abstract, or technical sense. With the increasing sophistication of those individuals capable of breaking into a computer system (intruders), and the ubiquity of high value data stored on digital systems, those tasked with investigating these crimes adopted investigative techniques identical to those they were tracking. Whereas the ballistics expert or fingerprint analyst possess skill sets different from the individuals who committed the crimes, those working with networked systems have to learn and adopt skills identical to those of the individuals being investigated.
Internetworked systems are not the only technology being abused by criminals, or implicated in criminal activities. PDA's, phones and digital cameras can all contain vital clues as to the circumstances surrounding a crime. With the continued convergence of technologies, the number and variation of devices will merge and expand into new and even more sophisticated equipment. For example, the convergence of the digital
camera and mobile phone leads to interesting trails of evidence. These changes and re-definitions of a computing device occur frequently, and rapidly. The almost limitless configuration of devices capable of storing and processing digital information, results in an ecosystem of variant devices, with differing levels of openness, methods of interaction, and so on. This poses a significant challenge for the computer forensic investigator, who must keep their skill-set updated and relevant. This has been addressed, in part, by the obvious need of sub-disciplines within this domain. These can be referred to as Computer Forensics, PDA Forensics, Network Forensics, and so on. The term "Digital Forensics" (DF) encompasses all of these particular sub-disciplines. While various bodies have attempted to offer formal definitions of the field of digital forensics, the term as used in this description has a slightly broader meaning as discussed above, because as will be apparent the teaching of this disclosure can be applied to any investigative or analytical activity relating to any DCS.
The problems facing DF do not solely relate to the variance within DCS. Different domains require distinctive outcomes from an DF investigation. Broadly, these domains can be defined as: civilian; and organisational. The civilian context normally involves law enforcement agencies investigating individual(s) with the intent of solving or prosecuting an alleged crime. The distinction between these two domains is the manner in which crime is detected, reported, and controlled. Within the civilian context, crimes are reported to, and therefore investigated by law enforcement agencies. Within the organisational context, a crime can constitute a number of different types of digression, all of which have differing levels of severity. A circumvention of an IT usage policy would be a transgression against the organisation. Depending on the policy circumvention, it may also break various laws. If this is the case, the
organisation may seek prosecution, which means the necessary law enforcement groups would be contacted. Yet, the organisation may chose to deal with the policy circumvention itself, as it may have the necessary technical and operational expertise to conduct such an investigation in- house, or possibly because of the sensitive nature of the operations or incident that occurred.
A DF investigation may be carried out in a variety of circumstances. Temporal aspects of the investigation can be the determining factor. The traditional view is of the post-event analysis of a computer-related crime. This may be the analysis of records left on a computer system that has been used in connection with a criminal offence, for example, the examination of a computer system that contains documents relating to a drugs offence. With the rise of more technologically sophisticated crimes, it may be the case that digital forensic techniques are deployed during the time in which a crime is being committed. As a consequence, the tools and procedures during an investigation differ, along with the reasons for launching an investigation, and the desired outcome of an investigation, for example the primary objective of a law enforcement operative is to successfully prosecute the criminal and the forensic activity is usually carried out after the event of a crime, while the primary objective of a military outfit would be to maintain continuity of operations and the forensic activity may carried out during the performance of a crime.
The methodology employed to investigate an incident will depend on a number of factors, some of which have just been highlighted. One of the principle concerns of the security domain is with intrusion detection, that is, prevention mechanisms and detection techniques for preventing unauthorised access to data. The security domain classically employs a model of policy circumvention which can broadly be defined as: 1.
Reconnaissance; 2. Foot-printing; 3. Enumeration; 4. Probing for weaknesses; 5. Penetration; 6. Gaining the objective; 7. Clean-up.
However, within the DF domain, unlike that of security, these steps do not offer the entire framework. The methodological context and investigative model for a security domain are not always applicable to the DF domain.
The following example illustrates how, in particular, policy context influences how different systems would view an event where users A and B are users on a computer network. A logs in from a given node, N, to run the program EXAMPLE to access ?'s customer record.
Detection of Attack: Is the program EXAMPLE an attack tool? Are the actions taken by the user A part of an Attack?
Detection of Intrusion: Is the user A really logged in? Does A normally log in from node N at this time? Does A normally run program EXAMPLE to access B s customer record? Is A allowed to do so?
Detection of Misuse: Is A supposed to be running program EXAMPLE? Is A supposed to be accessing ?'s customer record? Has A run other programs that make running EXAMPLE a policy violation? Has A accessed other records that make accessing ?'s records a policy violation?
Analysis: What is happening on the system? Where is node W? Who is using the account called "A" and where is that user located? What program is being called by the name EXAMPLE? What part of the database is being accessed by the label "?'s customer
record" and where is it being stored? What changes happen as a result of this action?
Thus, it can be seen that the goal of an investigation is coloured by a particular context. The same is true for DF.
Accordingly, the investigative model for DF differs in some respects from that of the security domain. The DF investigative model borrows from fundamental forensic science models, but also incorporates domain specific knowledge.
The manner in which an investigation proceeds is normally modelled, or predefined, in order to ensure the investigator fulfils core investigative procedures, as well as the different considerations that must be made in each set of circumstances. Context should be considered, and can be important at various different stages of the investigation. As we will see, there are a number of different factors that must be considered, as they will have a fundamental impact on the investigation and its outcomes. A good model of investigation can provide a basis for common technology, aid implementation and testing of new technology, and provide a common basis for technology and information sharing. The fundamental methodology used in digital forensics is the same as used in forensic science, comprising the following stages:
Acquisition - Obtain the data from the device(s) under investigation. The quantity of information gathered will vary depending on the size of disk analysed or the size of distributed system under investigation.
Analysis - Construct a hypothesis about the events leading up to and after the incident. Use the evidence collected from the environment in order to confirm or refute the original hypothesis. Construct a new hypothesis as necessary and reiterate the process.
Presentation - The findings of the investigation will be written up and presented in a manner that is easy to understand, explaining, with reference to the evidence collected, the conclusions. All abstracted terminology should be explained in detail.
The evidence collected during an investigation must be handled in an appropriate manner, as it is this material that will be used to prove, or disprove a hypothesis constructed and can be inculpatory or exculpatory.
Due to the ad-hoc nature in which digital forensics has evolved, a wide array of guidelines have been published by law enforcement agencies, governments, system administrators and even system intruders. There are various inconsistencies between and deficiencies with these models. However, the stages of a DF investigation will generally include the following:
1. Preparation - This stage involves implementing and establishing proper audit and controls in order to detect an incident. This will be defined in a policy, agreed upon by management, which takes into consideration legal aspects and the goals of the organisation in question. In addition, this phase consists of collecting the appropriate tools, establishing techniques and training personnel to use these tools and techniques.
2. Incident Response - This stage consists of identifying an incident from the auditing systems, or any other indications available. This phase will consist of establishing the extent of the intrusion, and preparing a response according the goals of the organisation.
3. Data Collection - At this stage, data should be preserved, if possible, from contamination. All necessary information from both the physical and logical crime scene should be recorded using standardized and accepted procedures.
4. Data Analysis - Construct a hypothesis about the events leading up to and after the incident. Use the evidence collected from the environment in order to confirm or refute the original hypothesis. Construct a new hypothesis as necessary and reiterate the process.
5. Presentation of Findings - Findings of the investigation will be written up and presented in a manner that is easy to understand, explaining, with reference to the evidence collected, the conclusions. All abstracted terminology should be explained in detail.
6. Incident Closure - Depending on the findings or intent of the investigation, criminal proceedings may be initiated, disciplinary hearings may be conducted, or a review of IT policy may be undertaken.
Another key aspect of a DF investigation is the nature of the data itself. Every DCS creates, stores or manipulates digital information which form
the basis of digital evidence. DCS's create a diverse range of data other than those familiar to an everyday unskilled end user. For every text document created and saved to a hard disk, or for every data packet routed from one end of the Internet to the other, a voluminous amount of data relating to each activity is created, manipulated and discarded. Some of this information is useful, and can be used in a variety of ways, from debugging an application, to signalling that various equipment or applications are working in a correct manner. Indeed, some of this record keeping by digital computing systems is desired as it allows the operators to gather situational awareness, which generally comes in the form of a log file or audit event. Low-level system events, as well as application- specific events, all generate records by the use of various event logging mechanisms provided by the relevant Operating System (OS), or in a bespoke manner by the application itself. This situational awareness can provide the system administrator with enough information to understand when a particular user has logged in, or for the software engineer to have an indication of the last error message a piece of software generated.
All data produced by a system is regarded by the DF investigator as possible evidence. The terms "data" and "digital evidence" will henceforth be used interchangeably. The potential richness of this pool of data can be limited by, or even diluted, by the type of data that gets logged, and which particular software system logs it. Depending on context and intent, the scarcity, or over-abundance of data can be either be beneficial, detrimental, or both. Because the digital evidence used for an investigation can originate from many different sources, and not just the output of a security monitoring system, we will define the general term Logging Entity (LE) which will be used to cover all forms of digital collection and logging apparatus.
Data gathered has a number of characteristics which need to be taken into account when designing a DF methodology and system.
Firstly, data can be viewed at different levels of abstraction. This is also known as the complexity problem. At the lowest form, data is generally incomprehensible to humans, as it is a series of one's and zero's. It takes a great deal of skill to view data in this manner, and although not impossible, is not an efficient or a desirable form of analysis. The operating system or application will generally translate this form of data into a human-readable format.
One of the best examples of the complexity problem can be outlined using HTML. HTML is a mark-up language that defines the layout and look of a Web page. At its lowest representation, it is a collection of one's and zero's. When opened in an HTML editor, it appears as a series of tags.
Although not unreadable, it is difficult for the human analyst to make sense of this information, as, for example, images will merely be represented by links to image locations on a storage medium, and layout will be represented by encoding that is not intuitive. When HTML is open using a Web browser, it will be interpreted and rendered into a form that is readable by humans, including all images, text and appropriate layout.
Data also requires interpretation (this is also part of the complexity principle). An analyst requires tools or software to render data in a manner that is in a human readable format. This introduces a number of problems which relate to the translation of data from one layer of abstraction to another. If the tool being used mistranslates the data, then such an error may misrepresent the original meaning or the reality the data conveys.
Data is also characteristically fragile. It is easy to alter as it is highly malleable, thus it is prone to change by accident or through intent. This is why there is a great deal of focus on how data is handled by the investigator, ensuring provenance throughout all stages of an investigation.
Due to the fragility of digital evidence, practice guidelines need to be adhered to help maintain the integrity of data collected and maintain a coherent and verifiable chain of evidence. Considerations include steps to ensure that no action taken by a law officer changes data on a DCS under investigation unless they have the relevant competencies, and then if any changes are made, an audit trail or record must be created and preserved, with the overall person responsible for the investigation also being specifically responsible for ensuring that these principles are adhered to.
A further characteristic of data is the sheer volume that is generated, which raises issues of storage and filtering for effective analysis. With a large increase in data production per-person, and a large amount of this data being stored on magnetic media, with hard disks being the most popular form, it can be stated that there is more and more information being produced. Combined with a rapid increase in internet users, the amount of data being transmitted and therefore subject to existing auditing and logging infrastructure will grow. Hard disk size grows year-on-year, making the proposition of analysing every single file on a single PC, never mind those belonging to a large organisation, a daunting task which is expensive in terms of time and resources.
Data is also difficult to associate with reality. Some of the most difficult aspects of attempting a prosecution of an individual have related to proving that the individual was in operation of the machine at the time the
alleged crime took place. Along with the ease of computer automation, and the problems that malware cause, it can be difficult to attribute blame easily to an individual. For this purpose it is the use of physical (cameras, and door entry logs) and digital (computer usage logs, file traces) artefacts in combination that provides the most compelling evidence, although improved digital methods in themselves would be useful.
Data collected also needs to be evaluated against the relevant fitness criteria. For example to be suitable for use as evidence in a court of law, the data must be shown to be complete, accurate and authentic.
Accordingly there is a need for improved DF techniques that can address one or more of these problems with data and that is preferably compatible with accepted principles of DF methodology and/or can cope with different investigative contexts.
According to a first aspect of the present invention there is provided a digital forensic analysis method comprising the steps of: collecting system call data from a digital computing system (DCS); converting the system call data to a sequence format; selecting from a system call sequence database a test sequence of system calls; and performing a sequence matching step to detect matches between the test sequence of system calls and the system call data collected from the DCS.
A "sequence matching step" is performed by a sequence matching algorithm, which can be any pattern matching, pattern recognition, or sequence alignment algorithm. A "sequence format" is any data format suitable for use with a sequence matching algorithm.
Optionally, the sequence format represents one system call as a sequence element and the sequence is a string of the sequence elements. The sequence elements are optionally alphanumeric characters, preferably letters of the Roman alphabet.
Optionally, the sequence matching step comprises the use of a biological sequence matching algorithm.
A "biological sequence matching algorithm" is taken to be any sequence matching algorithm that is known for use in the fields of protein or DNA sequence matching. The term also extends to modified versions or adaptations of these biological sequence matching algorithms.
Optionally, said biological sequence matching algorithm uses Karlin- Altschul statistics as a basis for determining sequence alignment.
Optionally, the method further comprises frequency domain analysis of system call data. This analysis can be carried out either before or after the step of converting the system call data to a sequence format.
Optionally, the frequency domain analysis comprises assigning a value as an amplitude to each system call.
Optionally, the frequency domain analysis comprises constructing a signal on the basis of the number of occurrences of individual alphabets per unit time in the sequence of system calls.
Optionally, an entry in the system call sequence database is constructed by:
running a test scenario on the DCS and collecting system call data generated by the test scenario; converting said system call data to a sequence format; and recording a sequence of system calls as a database entry corresponding to the test scenario.
The "sequence of system calls" thus recorded is referred to as a "fingerprint" in the following description.
Optionally, said database is unique to a given DCS.
Optionally, said system call data is collected at the interface between kernel and user space of the DCS, most preferably by a software wrapper that intercepts all system calls made between said kernel and user space.
Optionally, the method further comprises the step of emulating a user interface, and using the sequence of system calls to recreate graphically user actions on an emulated graphical user interface.
Optionally said emulator reads ahead of a displayed time slice and compiles the data; and then subsequently displays the simulated user experience in the order as input by a user of the DCS.
Optionally, one or more of the steps carried out following collection of data from a DUT are carried out at a location remote from the DUT and/or remote from a DUT host organisation.
According to a second aspect of the present invention there is provided a digital forensic system comprising:
data collection means for collecting system call data from a digital computing system (DCS); data formatting means arranged to convert the collected system call data to a sequence format; and sequence matching means arranged to detect a match between said collected system call data and a test sequence of system calls.
Optionally, said system comprises an evidence generation means arranged to run a test scenario on the DCS, collect system call data generated by said test scenario, convert the collected system call data to a sequence format, and record said system call data in a system call sequence database.
Optionally, the means by which the evidence generation converts the collected system call data to a sequence format can be the same data formatting means as used to convert the system call data from the DCS.
Optionally, the data collection means is provided at the interface between kernel and user space of the DCS, most preferably being provided as a software wrapper that is arranged to intercept all system calls made between said kernel and user space.
Optionally, the system comprises a performance monitoring means which is arranged to operate the evidence generation means
Optionally, said sequence matching means comprises biological sequence matching means.
Optionally, said biological sequence matching means comprises a component that uses Karlin-Altschul statistics as a basis for determining sequence alignment.
Optionally, said system further comprises an emulation means arranged to simulate a graphical user interface, to receive an input sequence of system call data and to display a graphical recreation of user activity based on said input sequence of system call data.
Optionally, said emulation means further comprises a compiler arranged to compile a portion of the sequence of system call data that is ahead of a displayed time slice; and processor means to interpret the compiled data and to display the simulated user experience in the order as input by a user of the DCS.
Optionally, components of the digital forensic system are distributed, with at least said data formatting means and said sequence matching means being at a location remote from the DUT and/or remote from a DUT host organisation.
Optionally, the components are distributed in a service-oriented architecture (SOA).
According to a third aspect of the present invention there is provided a system call sequence database comprising as entries system call sequences generated from running test scenarios on a DCS.
According to further aspects there are provided computer programs and computer program products for implementation of the preceding aspects,
which can be recorded on a computer storage medium including as a signal, or made available for download or other types of transmission.
The present invention will now be described, by way of example only, with reference to the accompanying drawings, in which:
Fig. 1 illustrates the layers of an x86 architecture Operating System;
Fig. 2 illustrates Windows NT code execution space and associated data structures;
Fig. 3 illustrates a matrix initialisation phase of a global sequence alignment method;
Fig. 4 illustrates a matrix fill phase of a global sequence alignment method;
Fig. 5 illustrates a matrix trace back phase of a global sequence alignment method;
Fig. 6 illustrates a matrix for a local sequence alignment method;
Fig. 7 illustrates a general system for digital forensics;
Fig. 8 illustrates the collection of system call data in the system of Fig. 7;
Fig. 9 illustrates the operation of an evidence generation means in the system of Fig. 7;
Fig. 9A illustrates an example implementation of a digital forensics sysem, according to a framework in which components of the system are distributed;
Fig. 10 illustrates a digital forensic method;
Fig. 11 illustrates an embodiment of a playback technique;
Fig. 12 illustrates an alternative embodiment of a system for digital forensics;
Fig. 13 illustrates the operation of an evidence generation means in the system of Fig. 12;
Fig. 14 illustrates an example entity-relationship diagram showing an example structure of a database for storing system call data collected from a data collection means;
Fig. 15 illustrates an extract of system call data collected for an example file manipulation scenario;
Fig. 16 illustrates the comparison of total pairs added to a database as compared with the number of unique sequences discovered by a sequence discovery tool;
Fig. 17 illustrates an example match section from a sequence matching algorithm match report;
Fig. 18 illustrates a search space for demonstrating the alignment between sequences. Each point on the diagram represents a pairing of letters from each sequence;
Fig. 19 illustrates a statistical representation of system call activity for a DUT, in a scenario where no user activity is occurring;
Fig. 20 illustrates a statistical representation of system call activity for a DUT, in a scenario where user activity is occurring;
Fig. 21 illustrates a key of the system calls shown in Figs. 19 and 20;
Fig. 22 illustrates the differences in empirical analysis, between a time- based proximity method that indicates pattern scoring and a generation- based substitution method that indicates biological-style evolution; and
Figs. 23-37 illustrate match results taken from match reports of various experiments demonstrating the efficacy of a sequence matching technique for various use case scenarios.
The text of the drawings forms part of the disclosure and is incorporated herein by reference.
Data found on a system is either associated with an automated process (syslog, event logs, Intrusion Detection System (IDS), etc.) or associated with the intentional actions of a human (a saved file, a viewed picture, a moved document, and so on). The data collected by a DF process can be categorized as being either raw or interpreted. Interpreted data poses a problem as it involves the translation of data as it passes through a number of different layers before ending up in its final state. The nature of
the translations and even in some cases the layers themselves are often undocumented or out of the control of the DF investigator.
Most of the data collected by a DF investigation can be considered to be of the translated variety, as it resides in logs, or in files on the hard disk. These types of data have passed through at least one layer of software, and therefore been subject to manipulation. The practitioner may not have all of the information available to them in order to make an accurate judgment as to whether the data has been tainted by passing through a layer of logic.
The use of interpreted or raw data has implications when considering the DF researcher, too. It is their objective to assess the fitness of the data to broadly match the criteria of Completeness, Accuracy, and Authenticity. Raw data has the benefit of allowing both the practitioner and researcher to assess the validity of the data collected. Yet, there is a drawback, as it can be noted that raw data is generally collected at the lower levels of the DCS. At this lower level, there is a greater chance of there being a voluminous amount of information to gather, and make sense of. For example, the Network Intrusion Detection System (NIDS) SNORT collects data from the lower layers of the OSI stack. The data it collects relates to all the layers of this stack, and therefore, while it is useful to see the data in this raw form before it has passed through the many layers of translation in the OSI stack there is an increased complexity and volume of data associated with this form of raw data. Yet the raw data provides the potential for rich data analysis, if it can be interpreted. There thus is a growing sense within the DF community that raw sources of data are of interest, and therefore an important area of research.
The logging entities that produce the more traditional, interpreted, sources of data, exist at all layers of the network abstraction. For example, from an end-to-end perspective, we can classify contemporary digital evidence collection as pertaining to the host, or the network. When considering a host, we include devices such as PDAs, laptops, servers and so on.
Network devices are also capable of being configured to report on their own operation, as well as that of the information passing through the device.
On the host, the layer at which the data is produced is also important. Contemporary logging entities producing event logs and syslog files are often tied into the operating system (OS) itself. For example, the syslog protocol is a logging facility provided on UNIX and Linux systems. This service can be used to report any form of event, from kernel-level events, to those produced by processes working within user space. For example, the Apache Web server can be configured to use the syslog service for reporting activities. The syslog service can be regarded as open, and the operation and implementation of this service is documented. In contrast, the event log service provided by the Microsoft Windows NT family of OS's can be considered as closed, as Microsoft have never published details on how it works. The Windows NT Event Log behaves in a similar fashion to syslog, as it provides a logging interface for bespoke software, along with the operating system to use, but the internals are an unknown. This highlights the problems facing DF investigators. There is an obfuscation of logic and origin when using events generated by the Windows Event
Log service. The Apache Web server reports errors to the Windows Event Log, which acts as a back-up when it cannot access its own, logging file services. It would be possible to independently verify whether the actions of the server were those being reported to the event log. This is because Apache is decoupled from the operating system. Yet, when a DF
investigator is faced with event logs which pertain to operations of the OS, itself, it is not as easy to decouple events from actions.
Finally, the intentional output of data on a system by a human operator in the form of a file, or document is one of the main sources of data available to the DF investigator. Existing solutions aim to provide faster, easier ways to analyse, and evidence discovery within ever-growing volumes of data.
The Intrusion Detection System (IDS) is the de-facto Logging Entity used both in research, and in practice to provide a basis for study or situational awareness for networked systems. An IDS dynamically monitors the actions taken in a given environment and decides whether these actions are symptomatic of an attack or constitute a legitimate use of the environment. Current IDS design, implementation and research relates specifically to the security field, and is limited to the detection of attacks, intrusions and misuse for the purpose of intrusion detection. A passive system will record policy circumvention events to a log, awaiting a human analyst's response. An active system will perform an action, such as attempting to block the circumvention in some manner, by blocking network access, or by stopping a user from accessing a file on a hard disk. The data can be collected either in real-time or at set periodic intervals.
The post-mortem collection and analysis of DCS for a DF investigation will generally occur within the bounds of a retrospective decision making process. Ideally, the detection of an event at a particular time will also provide for the collection of data pertaining to events occurring both before and after the detected event, as there is a requirement for inculpatory, and exculpatory evidence. The detection should include at least those events immediately preceding and following the detected event, and optionally a
plurality of events that precede and/or follow the detected event. This, in effect, not only changes the manner in which the detection function operates, but also the manner in which data is collected, as a-pirori decisions must be made as to the fitness of data for an investigation that may take place in the future, or not at all. There will be a need to collect suitable data such that the occurrence of an event can be detected and also verified according to the data fitness requirements of Completeness, Authenticity and Accuracy.
For example, the main focus of contemporary IDS research is interested in the manner in which a signature set can be built, and how quickly and effectively this signature set, or anomaly detection algorithm, can be harnessed to provide the real-time or near real-time analysis required. The issues of false positives and false negatives are of great concern to the this type of research. Yet, the DF community is interested in event reconstruction, and verification activities. This means the traditional security methods and considerations are not as relevant to the DF domain.
The use of unaltered, raw data from a system can provide both an attack, and a research vector. In particular, the use of back door, and/or debugging techniques provide a very powerful method of gaining an insight into how an operating system works, and how security controls can be subverted. By monitoring operating system paths of operation, it is possible for the attacker to gain an insight into the way the system works, and thus subvert its normal mode of operation. Similarly, the researcher does not need to rely on interpreted sources of data in order to gain an insight into how to solve a certain problem.
In the OS security field, there are various techniques for lower level data analysis and collection activities for both research and malicious activities,
including memory analysis and low-level disk analysis methods. One technique in particular has been used within the debugging and rootkit domain for some time. The technique, called library interposition, allows an individual to monitor, subvert, or change in some other way the normal operation of the system.
Library interposition involves the use of a software wrapper which is placed around, or hooks, the data structure of interest. Any subsequent calls to this data structure are intercepted by this wrapper.
The DF method of this disclosure involves the collection of system call data, preferably using a software wrapper. The details of the technique differ depending on the OS. Two example embodiments; a Windows based OS and a Linux based OS will be discussed herein. However it will be apparent to those skilled in the art that the collection of system call data can be achieved for any given OS, including any OS based on the x86 CPU architecture (such as both Windows and Linux), and OSs based on other CPU architectures. Further, the collection of system call data can be combined accordingly with the other aspects described in the disclosure, making necessary modifications to those other aspects in order to be compatible with the specific OS in question.
A system call is defined in this disclosure as being any request made between the user space and the kernel space.
In a first embodiment, data is collected by hooking Windows NT system calls. The Windows NT (New Technology) kernel, in some form or another, has provided the basis for Microsoft's Window's flagship Operating Systems, from NT 3.1 (released in 1993) , NT 4.0, Windows 2000 (NT 5.0), Windows XP (NT 5.1 ) and, to a degree, Windows Vista.
Each of these releases have been, at least, 32-bit, x86-based OS's. An x86-based OS, such as Windows NT and Linux, operates on fairly similar principles of security and privilege separation. The operating system is divided in a number of different layers (or protection levels) for security purposes, as shown in Fig. 1. Code execution and security is defined by privilege and ability to execute in either of these layers. Code that requires low-level, higher security access executes within the kernel space (Level 0 in Fig. 1 ), and everything else executes within user space (Levels 1 , 2, and 3 in Fig. 1 ). The reasons for this include; security; performance; and separation of authenticated code. One of the main purposes of this separation is for the separation and provisioning of services by the operating system to the applications that run in user space.
At the hardware level, x86-based processors do not recognise these different layers of the operating system, and instead view all code executing in terms of an eight byte data structure called the Segment Descriptor. It is this Segment Descriptor that maps the OS layers to different hardware runtime levels. The Segment Descriptor contains the start address of the code segment, the length of segment, and the privilege level at which the code will run at. The privilege level is something that the processor does recognise, and will run code accordingly. These privilege levels are divided into different levels of protection, or rings. These rings range from the highest privilege rating of Ring 0, through to Ring 3 as shown in Fig. 1. The inner rings are used for highly critical software modules, such as the Kernel, and the outer rings are used for less critical software modules, or anything that runs in user space. Windows NT only uses rings 0 and 3, due to cross platform and compatibility considerations. Code which executes at a lower privilege level cannot directly call, or access, code which runs at a higher privilege level. If this is attempted, a general protection exception is generated by
the CPU. The only way low-privileged modules can access higher- privileged code is via a protected and highly controlled interface called a gate.
Windows NT provides a gate to the kernel for code executing in user space to access underlying services. These provide for the provision of core OS services, such as file access, device access, thread handling, registry operations, and so on. Fig. 2 shows an outline of the manner in which this gate can be implemented, and the data structures and libraries involved in this process. For example, if an application running in user space was to request a file write operation, it would call the function WriteFileO which is present within the Kernel32.DII library. This library exports a list of functions which are subtly different to those in the NtDII. DII, and the parameters and entry points also differ. Once the program has made the call, the parameters are marshalled, and the functions within the NtDII. DII data structure are called. The NtDII. DII then calls the NtosKrnl.exe, which contains a data structure called KiSystemService(). This function then looks up a data structure, or table, called KiServiceTable. Within this table is a description of all the exported functions available to user space, and the parameters that require passing. Further data structures related to the KiServiceTable relate to the length of parameters that must be passed to each system call. It is at this point that the usage of the interrupt gate can differ. These range from the usage of the interrupt gate INT 2eH, the usage of a specialised SYSENTER feature on certain x86 CPUs, or the SYSCALL return. No matter the gate mechanism, the data-table and the result of a system call is the same. The result of a call to the KiSystemService() function then causes a hardware interrupt, which causes the system call to be processed. As mentioned above, it is this interaction between user space
code modules and the kernel that we define as a System Call. Calls between applications within user space are defined as Function Calls.
Each Release of the OS has a differing number of system calls, and in Windows 2000, 248 of these start with the prefix Nt. System calls prefixed with these symbols describe the base services offered by Windows NT. Within the kernel, the prefix of these system calls are generally interchangeable with the prefix Zw, which represents the manner in which they are called within the kernel. (Zw formatted system calls reside within the ntdll.dll library. The Zw and Nt prefixes merely signify which library the system calls are called from, and a different type of access check. They both access the same, fundamental, services). From the example given previously, within user space, WriteFileQ would appear without any prefix, NtDII. DII would marshal it as NtWriteFlle(), and from within kernel space, if it were called, it would be called as ZwReadFileQ. In spite of this naming difference, they both share the same memory location, and perform the same basic service task. The data structures, and system calls described here form the basis for the Windows NT Native API. This API is not documented by Microsoft, and the number, location, and descriptions of system calls can change with every kernel modification. These modifications can happen with each major OS revision, or with the release of a service pack. However, enough information has been made public to guide the implementation of the DF methods described herein.
The next major aspect of this disclosure relates to the use of sequence matching techniques, which are well established in the bioinformatics field. Sequence Alignment and Multiple Sequence Alignment are well known and studied techniques for quantifying and matching gapped or noncontiguous string sequences. Sequence Alignment (SA) is used to great effect within the field of bioinformatics for protein and DNA sequence
matching analysis. Proteins differ from DNA chemically and functionally. DNA is used to store information, and is built up from an alphabet of four nucleotides. Proteins fulfil various functions, as they form the basis of the structures and mechanisms within cells. The functional characteristics of proteins are derived from their ability to fold into specific three dimensional shapes. For example, a protein that folds itself into a stiff rod may be used for structural support. These functional aspects are dictated by the sequence of amino acids that form the basis of the protein sequence. In total, there are twenty amino acids within the protein alphabet. This alphabet comprises of single letter descriptions of amino acids. The full alphabet is presented in Appendix A.
For example, using the one-letter amino acid symbols, a protein sequence would look like:
These sequence strings, and the similarity between them, are what biologists search for when performing SA. The similarity (homology) between proteins can give an indication to the evolutionary distance between proteins. This is a powerful technique, as a protein's phylogenetic (physical) structure is expressed solely through these sequence arrangements.
The SA problem can be defined as part of the longest common subsequence (lcs) problem. Formally, this can be described where given two strings A = aia2...am and B = bib2...bn where m ? n over alphabet ? size s. The lcs problem is the manner in which the best alignment can be found between A and B of the longest length by introducing gaps. The length of the lcs is the measure of similarity between two sequences. This
problem is NP-hard, but is solvable using dynamic programming techniques. There are two general ways in which to solve this problem - local alignment or global alignment.
The Needleman-Wunsch global alignment algorithm is an efficient method to search for a sub-sequence. This method uses a two-dimensional matrix to perform an alignment between two sequences. For example, consider the strings STRAINED and BRAIN. We can try and align these two strings with a simple scoring system which gives +1 for matching letters, -1 for mismatches, and -1 for gaps. The alignment will look something like the following:
STRAINED STRAINED -BRAIN-- B-RAIN-
The above alignment will give the same score, under the scoring system described. The scoring matrix assigns a score to the fitness of each match, and a pointer to the most likely match. By following these pointers, a match is formed, and a score is given to the fitness of match. The algorithm has an initialisation phase, fill phase, and trace back phase.
In the initialisation phase, the first column and row of the matrix is filled with values, as shown in Fig. 3. Each row and column are filled with the gap score (-1 ), and then multiplied by how far away they are from the origin. The arrows are t