Thumbnail image
October 27, 2021
Constructing Custom Word Lists and Automating Dictionary Attacks on Password Protected Zip Files in Linux
Encryption is often used by corporations to protect sensitive or personal data. However this tool can also be used to prevent access to key evidence. How then, can you take steps to access this information in its encrypted state?


An individual attended an NHS walk-in centre in order to speak to a physician and they divulged a criminal offence which was reported to the police.  Subsequently a large quantity of digital equipment was seized and sent to the digital forensics laboratory for analysis.  During the analysis a large number of unique zip files were located which were password protected and the suspect was refusing to cooperate with the investigation.  Their contents were therefore of interest to the investigation.


There were 1,017 unique password protected zip files for which the passwords were not known.  Compounding this problem was that the user was an interpreter and therefore was conversational in several foreign languages.  Some passwords that were retrievable from the systems showed the use of Japanese words written in the Latin alphabet.  It would therefore be unlikely that commonly available wordlists would be effective. Whilst some circumstantial evidence had been obtained, gaining access to these containers would prove key.


When zip files are password protected, their contents are encrypted.  Commonly used algorithms are AES-256 and ZipCrypto.  Certain information about the files within them such as the file name, size and last modified date are available without the password being known.

The tool fcrackzip (Free/Fast Zip Password Cracker) is a Linux tool by Marc  Lehmann which is widely used to automate attacks against password protected zip files.  It is able to crack password protected zip files with brute force or dictionary based attacks, and can optionally test the results with unzip.  Initial use with large commonly available word lists such as “realhuman” and “rockyou” did not yield any results.

After his machine was virtualised as a guest machine, a number of passwords were recovered from the suspect’s saved passwords within his Firefox password manager.  With these some of the zip files were able to be opened but not all.  The description of fcrackzip from the man page is as follows;
“fcrackzip searches each zipfile given for encrypted files and tries to guess the password. All files must be encrypted with the same password, the more files you provide, the better.”

The large number of zip files had different passwords, therefore the process would need to be automated.  The solution to this problem was therefore as follows;
Construct a custom word list specific to the suspect and,
Automate the extraction of approximately 1,017 zip files using a wrapper for fcrackzip to handle different passwords.

Method: Building the Wordlists

Once the disks had been imaged and pre-processed in X-Ways Forensics, the zip files and other artefacts were extracted.  This was conducted on forensic workstations which utilise Windows however the majority of the work in this case was completed in Linux, utilising the ease of use of manipulating text files that it offers.

pagefile.sys and hyberfil.sys

Windows uses the pagefile.sys file as virtual memory when it has no random access memory (RAM) left to utilize.  When a Windows computer enters hibernate mode, it completely writes the memory out to the hard drive as the hyberfil.sys file before powering down.  Both files contain outputs from the computer’s memory at certain times.  Passwords can be stored in RAM unencrypted.  Unfortunately no RAM dump from any of the computers was obtained prior to submission.

strings is a Linux tool which extracts strings of printable characters in files.  The output of this formed part of the dictionary.  The following command was used to process the two files and output it to a text file on a Linux system with printable characters of length 5 or more.

strings -n 5 pagefile.sys hyberfil.sys > pf_hf.txt

Firefox Password Manager

Firefox was the primary browser used by the suspect.  Although he also used the TOR Browser which is based on Firefox, it is modified so that passwords are not capable of being stored.  Using virtualisation software, his disk image was started as a guest virtual machine, and Firefox started.  Within Firefox on the guest machine, Settings > Privacy & Security > Saved Logins… reveals all currently saved passwords, of which there were over 40 unique passwords.  At least one item from this list was the same password for at least one of the zip files.  All these saved passwords were exported to a text file to form part of the dictionary.

Internet History and Interests

Whilst the suspect spoke several languages, much of his interest was in Japanese culture, and from examining his computer, most of his translation work was in Japanese.  Many of the passwords that had been seen so far, were also in Japanese but in Latin characters.

html2text is a Linux utility which reads HTML documents from the input-files, formats each of them into a stream of plain text characters, and writes the result to standard output.  Whilst it performs a similar function to strings, it removes HTML tags which are otherwise printable characters.  A website which was bookmarked and a large sample from the suspect’s internet history which was recovered was visited and saved.

html2text ./*htm* > webpage.txt

This output was unsatisfactory for word list generation however, although tags were removed from the website, multiple words per line were still listed.  Although it would be easy to split the words by white space in bash, it is a laborious task for multiple web pages which must be saved first.  This could be automated but a solution already exists.

CeWL is a ruby application which “spiders” a given URL to a specified depth, optionally following external links, and returns a list of words which can then be used for password crackers.  Every website which was bookmarked and a large sample from the suspect’s internet history which was recovered was visited with CeWL.  The below example is for one such website.

ruby cewl.rb -m 4 -d 2 -w cewl.txt -v

This command outputs in verbose mode to cewl.txt all 4 character words (not including HTML) from the website, following links to other websites within the same domain to a depth of 2 sites.  As some sites like this news site are large, this “spidering” can take some time.  The task of visiting the sites and building these word lists took one week.

Personal Information (Common User Passwords Profiler) is a Python script which automates word list generation when provided with specific information about an individual.  When provided with information such as first name, surname, date of birth, nick name, partners name, pet names, and keywords, it will generate multiple combinations of these with random numbers and “leet speak”.  It also has many language specific word lists including an 115,600 word Japanese list, which was imported for recombination.

python -i


crunch can create a word list based on criteria you specify.  The output from crunch can be sent to the screen, file, or to another program.  The usefulness of crunch in this case will be limited as information analysed so far seems to indicate long and complex passwords.  crunch will still be used however to generate a word list for smaller words, should some of the many zip files be using one.  The cost in doing so in time and computer resources is low as long as the number of possible combinations is low.  An example of the word list sizes when completed is listed below.

The size of the word list in bytes created by crunch is approximately (x ^ y) * (y + 1) where x is the number of characters and y is the length of the password.  For example, all combinations of 6 characters alphanumeric lower and upper case the file with the following size is created;
(26 + 26 + 10 ^ 6) x (6 + 1)
56,800,235,584 x 7 = 397,601,649,088 bytes

In reality however, crunch creates slightly larger files.  With the following file size estimates it can be seen that large word lists generated within crunch become infeasible due to storage limitations;
1 to 4 characters 71 MB,
1 to 5 characters 5 GB,
1 to 6 characters 375 GB,
1 to 7 characters 25 TB.

Whilst these generated word lists can be piped directly to other commands, meaning storage is not an issue, the sheer number of possibilities mean that processing will take a long time and without (and even with) specialist optimised hardware, guessing the correct result is unlikely.  As such, the crunch word list was only included to cover passwords from 1 to 6 characters in length of all alphanumeric upper and lower case permutations as below.

crunch 1 6
-o crunch.txt

Other Sources

The suspect provided a list of 12 possible passwords to assist.  Although none of these worked on the zip files, it was possible that they were similar to the passwords for the zip files, so they were included in a text file.

A small book was recovered from the suspects home address that listed various account details and passwords.  Although none of these allowed access to the zip files, it was possible that they were similar to the passwords for the zip files and as such they were also included in a text file for later recombination if required.

Combining All Word Lists

All the word list text files were merged into one dictionary for use with fcrackzip.  This made the process simpler and also removed duplicate entries such as those in both crunch.txt (all 1 to 6 character alphanumeric) and from CeWL (which was set to words of 4 or greater in length).  By removing duplicates, the length of time it would take to process the zip files would decrease.

sort ./* | uniq > dictionary.txt

Method: Building the Wrapper

As previously described, fcrackzip can only process multiple zip files if they all have the same password. As they do not in this case, a bash script was created to automate this process. There were many ways to solve this problem, but this one was mine.

The core of the script would be simple and efficient, invoking fcrackzip to do the extractions of the zip files.  Robust error checking and counters would also be implemented for a summary at the conclusion of processing.


From the 1,017 unique zip files, all but 39 (96%) of them were successfully extracted.  Within the zip files was a selection of files which proved the offence beyond a reasonable doubt and the suspect entered a guilty plea.

In order to analyse the success of the differing methods, I searched for each password within each of the separate word lists that were created before combining into a dictionary.  For example, to search for the password “gekokujo” within the lists a simple grep search was conducted.

sort ./* | uniq > dictionary.txt

From the different sources, there was a large degree of overlap.  Although pagefile.sys and hyberfil.sys had massively differing contents, they both contained the same 5 passwords and as such have been combined in the results table.  The suspect provided a password which was generated in crunch, and both the book and generated passwords, all of which were saved in Firefox.  Additionally generated two passwords which were in crunch.  CeWL’s Japanese word list generated 4 passwords, 2 of which were recovered from CeWL’s “spidering”.  These two items have been combined in the results table although CeWL’s “spidering” only recovered 14 passwords alone.

From all the sources, Firefox, CeWL, pagefile.sys / hyberfil.sys and crunch delivered all the correct passwords out of the 978 correct passwords in the dictionary.  Duplicates were found in, the suspect’s book and from the suspect themselves.


Throughout the process of building the custom word list, I often dismissed crunch.  From the information I knew about the suspect and the data I had already gathered about his devices led me to believe that it would not generate any useful words.  Although it generated only 2, it did generate some.  Despite the suspect’s often complex passwords, some were short enough to be within crunch’s 1 to 6 character range that I specified.

CeWL exceeded my expectations, although slow to run, especially across large sites, it generated a great word list specific to the susepct.  Although Firefox’s saved passwords resulted in more useful passwords for the zip files, that is almost to be expected, it was after all a saved list of passwords and most users do re-use them.

Low Tech solutions are often overlooked in investigating High Tech crime.  The importance of the suspects book which was located next to his computer table was initially overlooked.  Although all 4 useful passwords within it were already saved within the suspect’s Firefox browser, it is easy to imagine a situation where they were more security conscious, and did not save any passwords locally.

CeWL’s built in (downloadable) word lists also got results that CeWL or no other source did.  There are a large number of specific word lists available online and when multiple are combined, an effective attack can be made.  Whilst only one external word list was used in this case, if I had achieved fewer results, I would have tried more.

Time and resources permitting, tailoring a world list for dictionary attacks can have great benefits in gaining access to encrypted files.

If you have any questions about this blog, or would like to discuss how defensive services could help your business, please do get in touch.

Sign up to our newsletter to receive the latest updates