图片

Note: This is a translation of a previously written article into English, so if you have trouble reading the information, I will improve the translation.

Principle: Program Static Analysis refers to a code analysis technology that scans program code through Lexical analysis, Parsing, control flow, Data-flow analysis and other technologies to verify whether the code meets the standards, security, reliability, maintainability and other indicators without running the code. This technology can be used to verify whether the software is a virus, generally analyzed from the following aspects:

String Checking​

In the case that the program is not running, some tools are used to extract the program string to see if there is suspicious information to determine if it is a virus, the principle and analysis are as follows:

Definition: A string or string (String) is a string of characters consisting of numbers, letters, underscores, etc. It is mainly used for programming, concept descriptions, function explanations, etc.Additional knowledge: a string is similar to an array of characters in storage, so each individual element of its bit is extractableCommonly used for: output information, URL addresses, file names, path information, etc.As computers can only recognise 0 and 1 numbers, encoding techniques are often used to solve this problem in order to use the string specified by the inputDefinition: Encoding is the process of converting information from one form or format to another also known as the code of a computer programming language. A pre-defined method is used to encode text, numbers or other objects into numbers, or to convert information or data into a defined electrical pulse signal. Coding is widely used in electronic computers, television, remote control and communications. Encoding is the process of converting information from one form or format to another. Decoding, is the reverse process of encoding.Common coding techniquesCode:

1
2
3
4
ASCII: A computer coding system based on the Latin alphabet, mainly used to display modern English and other Western European languages.
Unicode: An industry standard in the field of computer science, including character sets and encoding schemes, Unicode was created to address the limitations of traditional character encoding schemes by setting a uniform and unique binary code for each character in each language to meet the requirements of cross-language and cross-platform text conversion and processing.
GB 2312: It is used for the exchange of information between Chinese character processing and Chinese character communication systems and is commonly used in mainland China; it is also used in Singapore and other places. Almost all Chinese systems and internationalised software in mainland China support GB 2312.
GBK encoding: GBK encoding standard is compatible with GB2312, which contains 21,003 Chinese characters and 883 symbols, and provides 1,894 character code bits, with both simple and traditional characters integrated into one database.

To extract strings from the computer’s binary code, you can use the following tools

Strings

Official website: https://docs.microsoft.com/zh-cn/sysinternals/downloads/strings

Function: Finds printable strings in object files or binary files

Drawback: but it will ignore the contextual formatting. It may search for: a memory address, a sequence of CPU instructions, a piece of data, etc.

Limitations: It will only search for printable strings with three or more consecutive ASCII (2 zero-terminated) or Unicode (4 zero-terminated) characters ending in a terminator.

Target: Computer viruses can exploit this search restriction to cause Strings to search for useful strings (e.g. by turning all characters into two characters before stitching)

Tip: Change the file suffix when searching to avoid running

Case in point:

As most infecting viruses are infecting PE files, as this allows them to run their own virus code while the PE file is running. This allows the virus to continue to infect other normal files in order to spread itself. So from an antivirus point of view, you should first determine whether a file is a PE structure and then decide which method you should use to scan the file. So, how to determine whether a file is PE structured or not, let’s start with the concept of PE:

  • PE concept: PE (Portable Execute) files are a generic term for executable files under Windows, commonly known as DLL, EXE, OCX, SYS, etc.Scope: Windows executable programs and dynamic link librariesContains information: necessary information on how Windows loads files from the hard disk into memory for executionThe fact that a file is a PE file has nothing to do with its extension - PE files can have any extension. So how does Windows distinguish between executable and non-executable files? We call LoadLibrary and pass a filename, how does the system determine that this file is a legitimate dynamic library? This is where the PE file structure comes in.The PE structure图片

This specification describes the structure of executable (image) files and object files under the Windows family of operating systems. These files are referred to as Portable Executable (PE) and Common Object File Format (COFF) files, respectively.

https://learn.microsoft.com/en-us/windows/win32/debug/pe-format

Commonly used PE analysis tools​

PE-bear: https://github.com/hasherezade/pe-bear

图片

PEview(roguekillerpe): https://www.adlice.com/download/roguekillerpe/

图片

PPEE: https://mzrst.com/

图片

CFF Explorer: https://ntcore.com/?page_id=388

图片

PE Explorer: http://www.heaventools.com/overview.htm

图片

Checking and killing techniques:​

Determining if a program entry point in a PE file is abnormal

Many viruses, after infecting a PE file, will usually add a portion of code to the PE file and then change the AddressOfEntryPoint in the PE header to locate the address to the code inserted by the virus so that whenever the file is run, the virus code will be the first to run.

In general, many viruses place the code inserted into the PE file at the back of the PE file and then place a statement at the end of the code to jump back to the real entry point of the original PE file. This allows the user to execute the virus code unnoticed. Anti-virus software can determine whether a file is suspected of being infected by a virus based on whether the entry point of the PE file is abnormal. If the entry point of a PE file points to something other than this, then the file is suspected of being infected by a virus. Of course, this subjective judgement is not always accurate, but it can be considered a basis for judgement. The heuristic scan we mentioned last issue uses such features to help determine unknown viruses.

Some viruses have also come up with a number of ways to change the program flow without modifying the entry point in order to prevent such detection by anti-virus software. For example, changing the code of the original entry point program and then jumping to the virus body.

Extracting feature codes based on PE structure

Feature codes are extracted by dividing the file into different parts and then extracting a certain length of content from each part as a feature code. The problem with this method is that many viruses have similar features, such as the PE structure we are discussing, and a large part of the beginning of many PE files is the same, so it is not ideal to extract the features by dividing the file into equal parts. This is where we considered using the PE structure to extract a certain amount of content from each section as feature codes, or using various key points as references to find feature codes in the vicinity. In this way, the drawbacks of the equal division of files to extract feature codes method mentioned above can be greatly avoided, and the variability of feature codes among different viruses is enhanced. For example, for this detection of CIH virus, features near the PE Header and near the entry point were examined.

Demo:

Identification of CIH virus

There are three characteristics:

The first is that if the first byte of the PE Header is non-zero, it is likely to be infected, and CIH itself uses this to determine this.

However, this feature is not always reliable, as programs that are not infected with the CIH virus may also become non-zero in this area for various reasons, so two additional code features are added.

CIH will change the code entry point to point to itself, based on this, we took the entry point offset feature and used the siddt action and the two actions of hanging the file system hook behind it as features, so that it is more reliable.

Of course, all 3 features are concentrated in the virus header, if we want to be more reliable and avoid false positives within the family, we can also add some code behind the virus body

Linking libraries and functions​

How are linked libraries and functions targeted by computer viruses when they can bring so much useful information to the analysis of viruses? With this in mind, learn more:

The reason for targeting: the virus uses the import table in the PE structure to import into the computer’s memory the link libraries, functions and other things containing malicious content that the computer virus needs, and calls the functions in the dynamic link libraries (linking the computer virus code to the dynamic link libraries through the link libraries) to prepare the work

  • Introductory question: what is linking and what are the linking methodsThe problem that linking solves is the integration of our own code with a library written by someone else.
  • Static linking: is the least common method of linking code bases on Windows platforms, but is more common in UNIX and Linux programs.
  1. What: The binary code for all required functions is included in the executable file when it is generated (link time). Therefore, the linker needs to know which functions are required by the target files participating in the link, and also what functions are available in each target file, so that the linker knows if every function required by the target file can be linked correctly.If a function required by a target file is not found in a participating target file, the linker reports an error.There are two important interfaces in the target file to provide this information: one is the symbol table and the other is the relocation table.When a library is statically linked to an executable, all the code in this library is copied to the executable.
  2. Advantage: no library dependencies are required at the time of release, i.e. no more libraries to be released with, the application can be executed independently.
  3. Disadvantages: However, there is no information about the linked library in the PE file header. This method results in a larger executable and takes up more memory space; if the static library is updated, all executable files will have to be re-linked to use the new static library. This linking method is not normally used by computer viruses to reduce the size of the virus.
  4. Linking time: at the time of generating the executable (linking done during compilation)
  • Dynamic linking: Dynamic linking is the most common and should be of most concern to malicious code analysts. Dynamic linking information is written in the import table and when the code base is dynamically linked, the host operating system will search for the required code base when the program is loaded.
  1. Features: Instead of directly copying the executable code at compile time, this information is passed to the operating system by recording a series of symbols and parameters, which are passed to the operating system when the program is run or loaded. The operating system is responsible for loading the required dynamic libraries into memory, and then the program, when running to the specified code, goes to share the execution of the dynamic library executable code already loaded in memory, eventually achieving the purpose of run-time connectivity.
  2. Advantage: multiple programs can share the same piece of code without the need to store multiple copies on disk.
  3. Disadvantage: As it is loaded at runtime, it may affect the pre-execution performance of the program.
  4. Link time: when the program is running or loaded
  • When the application calls the LoadLibrary or LoadLibraryEx function, the system tries to locate the DLL in load-time dynamic linking search order (see Load-time dynamic linking); if found, the system maps the DLL module into the process’s virtual address space and increases the reference count. If the code of the DLL specified when LoadLibrary or LoadLibraryEx is called is already mapped to the virtual address space of the calling process, the function returns only the handle to the DLL and increases the DLL reference count. Note: Two DLLs with the same filename and extension but not in the same directory are not considered to be the same DLL.Note: Although runtime linking is not popular in legitimate programs, it is commonly used in malicious code, especially when the malicious code is cased or obfuscated. Because shelling or obfuscation destroys the import table of a computer virus, without which the Windows system will not help the virus to complete its linking work, it is necessary to use run-time linking as a method to load the required linked libraries and functions into memory space at runtime.
  1. Features: link only if needed for fit
  2. Advantage: executable programs using run-time linking only link to the library when a function is needed, rather than at program startup as in dynamic linking mode
  3. Disadvantage: you need to use the relevant function to call it
  4. Link time: when a function call is encountered
  • Link-based analysis:The PE file header lists all dynamic link libraries and functions required by the computer virus codeDynamic link library and function names can be used to analyse the function of a computer virusInformation on commonly used dynamic link libraries
  • Commonly used analytical tools:
    Dependency Walker:Included in some versions of Visual Studio and other Microsoft development packages to support dynamic linking functions that list executable files

  • Common functions in viruses:

  1. LoadLibrary: dynamically loads the dynamic link library from the hard disk into the computer virus memory space
  2. GetProcAddress: finds the address of the corresponding function in the DLL
  3. URLDownloadToFile(): will download a file from the InternetImport functionsThe PE file header also contains information about the specific function used by the executable, as you can only see the name of the function in the import function, in order to understand the parameters, functions and usage of the function, you can find this information in Microsoft’s MSDN or, of course, using a search engine.Exporting functionsSimilar to the import functions, the export functions of DLLs and EXEs are used to interact with other programs and code.Usually a DLL will implement one or more functions and then export them so that other programs can import and use them.The PE file also contains information about which functions are exported in a file

    Ancillary kill detection​

Anti-virus software, malware checking platforms and malware analysis platforms are commonly used to assist in the checking and killing process, and they have the following advantages:

Having a virus signature database: a database that contains various “lookalikes” of known viruses, based on which proprietary characteristics, software can be identified as a virus, mainly for known viruses.Virus targeting: the writers of computer viruses can easily modify their code to change the various characteristics of these viruses, often using the following techniques to avoid detection by anti-virus softwareCode:

1
2
3
4
Polymorphic techniques: semantic invariance, syntactic obfuscation, increased difficulty of inverse analysis.
Morphing techniques: functionally invariant, semantically obfuscated, increasing the difficulty of inverse analysis.
One-way execution techniques: undeciphered numerical guesses, hashes, increasing the difficulty of reverse analysis.
Rubbish instructions: use of a large number of instructions that are useless for analysis, making reverse analysis more difficult.
  • Have heuristic rules: because there are virus characteristics in the feature library is not, antivirus software did not check these unknown viruses, it is based on the known virus analysis experience summed up some rules to identify whether the software is a virus, mainly for unknown viruses.Virus for: the development of new types of viruses, not used also by antivirus software to know the characteristics and behavior has avoided antivirus software detection
    When there is no local antivirus software, traffic is not a lot of conditions such as the existence of restrictions, can be calculated by entering the file Hash value, to some websites using Hash value to check and kill, the principle and common query platform is as follows:

Principle: Hash is a unique algorithm (hash function) to calculate the unique identifier of a file, which varies from file to file, influencing factors can be file size, content, creation date, etc. …… calculates the hash value, using these characteristics to understand that the file is not corrupted or modified can also be used to query the analysis results in the query platform.

Calculation tools:

Hasher Pro: http://www.den4b.com/

HashOnClick: https://www.2brightsparks.com

Hash Generator Pro: http://insili.co.uk/

MD5 File Hasher Pro: http://www.digital-tronic.com/md5-file-hasher/

Advanced Hash Calculator: http://www.filesweb.com/

Virus Toal: https://www.virustotal.com/gui/home/search

图片

morality is one foot higher, the devil one foot higher​

Viruses often use shelling and obfuscation techniques to avoid being analysed by static analysis techniques