Skip to main content

Checksums and hashes

By 26 February 2019 January 21st, 2021 Blog
checksums

Checksums and hashes are related to each other and are very frequently used in IT and in various places; from databases to file systems and password databases.

Hashing algorithms, which represent a special form of checksum, are also used to create a blockchain. In order to understand how a blockchain works, having knowledge about hashing algorithms is very useful.

But to understand what a hash is, we start with a checksum.

Imagine you have a slow and unreliable network connection and you want to send an image to someone. The larger the image, the greater the chance that something goes wrong during the transmission. One solution is to send the image two or more times. If the recipient has two identical images, then there is a good chance that it is the right image.

However, this is a very laborious and time-consuming process. Suppose there is a mathematical operation that can convert the image into something much smaller. This could then be sent after the image in order to determine whether the image was properly received.

This mathematical operation, also referred to as a checksum algorithm, should, however, comply with a number of requirements:

1) IT MUST BE POSSIBLE TO CALCULATE IT IN AN EASY, FAST AND RELIABLE MANNER

The whole point of the example was to save time and resources. In addition, of course, it is very reassuring when the checksum always provides the same value for the same data.

2) THE CHECKSUM MUST CHANGE IF EVEN ONE BIT OF THE IMAGE IS DIFFERENT

Since the checksum contains less data than the image, it is inescapable that there is another image that has the same checksum. We call this a collision. However, requirements can be imposed on the algorithm to ensure that the chance that a collision leads us to an incorrect conclusion is as small as possible and as arbitrary as possible.

  • All possible checksums of all possible images are distributed homogeneously, with every checksum occurring equally often. This way, the statistical chance that we will be faced with a collision is as small as possible.
  • Theoretically speaking, we could send a very small image that is even smaller than the checksum; in such a case, a collision should never occur relative to different images of the same size;
  • A minimal change to the image must always result in a different checksum, otherwise we would never be able to determine whether the image was transmitted entirely correctly;

 

3) PREDICTABLE OR FIXED LENGTH

The checksum must have a predictable / fixed length. This way, the recipient can determine that the entire checksum has been received. Additionally, from a technical perspective it is easier to work with something with a fixed length than with something with a variable length.

See the image above for an example of a simple checksum algorithm.

If 77 is sent after the word “Nederland”, the recipient can determine whether the reception is correct. But is it a strong checksum algorithm?

  • It is certainly easy and fast;

  • If one letter changes, then the algorithm results in a different number, but if a second letter changes, the result might be 77 again. For instance, the word “Neaerlandg” will in this case also have a hash value of 77. In fact, the hashes are not divided homogeneously. A hash value of 0 will never occur, and neither will 1,000. Many hashes will have a value between 50 and 100;

  • This method certainly does not have a fixed length. If we restrict ourselves to country names, then the length will vary between two and three digits. For entire pieces of text, it can be anything;

The shown checksum algorithm is, consequently, not very strong, but it does illustrate what a checksum essentially is.

While a checksum works well as a measure against unintentional transmission and storage errors, it does not necessarily work well as a means against intentional changes to the data.

This is, for instance, relevant when the image that is sent will be placed on a passport. It has to be your photo and not that of someone else who intends to travel on your passport.

To accomplish this, there is a special class of checksum algorithms called the hash. Our example of a hash function is a very bad one; it is possible to approximately determine the number of letters from a hash value of 77. This makes it very easy to create a word that has the same number of letters and the same hash.

That is why we need to add an additional requirement for a hashing algorithm:

4) DIFFICULT TO TRACE BACK

As little information as possible should be discernible about the original data on the basis of the hash value. It should be virtually impossible to, in a short period of time, calculate alternatives (collisions) for the original data from a given hash value.

Imagine you have a set of transfers ready for the bank. You then secure them by calculating the hash and sending it along with the transfers. This way, the bank can determine that they received the information in good order and that nobody has modified the data.

Usually, hash algorithms are public. An attacker could therefore try to find a combination of transfers that would result in the same hash (a collision). If he then requests the relevant account numbers, he could steal from you.

Hash values typically have a length of 128 or 256 bits. With a good algorithm, an attacker needs to try, on average, a 1 with 39 zeros times before he finds a combination with the same hash value. With 1,000 attempts per second, this will still take a 1 with 27 zeros in years. To compare, the age of the universe is something with 10 zeros in years.

When hashing, an additional measure can be taken to make cracking the hash more difficult. The hash can be “salted”. Extra data is added to the data that will be hashed in the form of a password, secret or otherwise. We call this a “salted hash”. In our next blog, we will explain how the “salt” assists in the protection of the data.

Another area of application for hashes is the storage of passwords and pin codes. By storing the hash of a password or pin code, an application can verify that the user has entered the correct password without storing the password itself (by “hashing” the entry again).

If the database with hashes is accidentally leaked for some reason, the passwords are not immediately up for grabs. Hopefully, the owner of the site did use long and salted hashes.

 

STAY TUNED FOR THE NEXT PART OF THIS BLOG SERIES: CRACKING THE HASH

TOPICS DISCUSSED

  • Algorithm: Calculation method.

  • Checksum: Check digit, calculated over a block of data for the purpose of establishing an error-free transfer of data. An example of a checksum algorithm is the Cyclic Redundancy Check (CRC).

  • Collision: When two different blocks of data produce the same hash or checksum.

  • Hash: Check digit, calculated over a block of data for the purpose of establishing data integrity in a secure manner. Examples of secure hash algorithms: SHA-2 and SHA-3. Hash algorithms that should be avoided: MD5, SHA-0 and SHA-1.

  • Salted hash: The addition of extra data to the data block over which the hash is calculated, with the objective to make it more difficult to trace the hash back to the original data.