×

Registration

Profile Informations

Login Datas

or login

First name is required!
Last name is required!
First name is not valid!
Last name is not valid!
This is not an email address!
Email address is required!
This email is already registered!
Password is required!
Enter a valid password!
Please enter 6 or more characters!
Please enter 16 or less characters!
Passwords are not same!
Terms and Conditions are required!
Email or Password is wrong!

Drive to improve flash reliability

Solid-state drive (SSD) flash memory storage devices have accelerated laptops and server-based computing. As organisations embark on digital transformation and look to use techniques such as artificial intelligence (AI) to provide greater insight into the data they collect, there is demand not only for more storage, but increasingly fast data access. 

For example, Total Gas & Power sees an opportunity to deploy flash-based storage to accelerate on-premise applications that need to integrate with cloud services. Total Gas & Power has deployed Nutanix hyperconverged infrastructure, and has also been replacing its NetApp filer with a flash-based storage array,  

“We have got into integration in a big way,” says Dominic Maidment, technology architect at Total Gas & Power. The company is using MuleSoft to manage its application programming interfaces and is looking at how flash can power hybrid integration. “With on-premise data assets, access has to be as fast and as light as possible,” says Maidment. 

Revolution of storage 

Enterprise storage used to be synonymous with vast arrays of hard disks, made from a platter of disks spinning at thousands of revolutions per minute, each with a disk head floating just a few micrometres above. If the head crashes into the disk, data loss can occur. Flash SSDs that use Nand memory chips got rid of the spinning disks, so they should be more reliable. Right?  

The problem is that although flash-based storage has extremely high levels of reliability, it is still prone to data loss and requires a host of measures to keep data uncorrupted. In flash SSDs, data is stored in blocks of memory, which are written to the SSD device as 4KB pages. Data is erased in 256KB blocks. So far, so good. But back in 2008, a study by Jet Propulsion Labs (JPL) said flash storage produced higher error rates when the same block of memory was continuously cycled with data being written to it, then erased. 

Ben Gitenstein, vice-president of product management at data storage company Qumulo, says this means flash drive sectors become unusable after a certain number of overwrites. As a result, flash-based storage devices can be written to only so many times before “wearing out”. 

The 2008 JPL study found that many systems implement wear-levelling, in which the system moves often-changed data from one block to another to prevent any one block from getting far more cycles than others. 

Beyond dynamic management of flash drive data blocks, the IT industry has tended to over-provision the number of flash drives required to allow for dynamic reallocation of bad sectors. In large datacentre facilities, routine maintenance requires replacing drives on a regular basis to reduce errors and bad chips. 

In 2016, researchers looking at the use of flash drives in datacentres warned that during the majority of days that a flash drive is operational (drive days), at least one correctable error will occur.  

However, the researchers reported that other types of transparent errors, which the drive can mask from the user, are rare compared with non-transparent errors. 

The results of the study into the reliability of flash storage in a datacentre environment were presented at the 14th Usenix Conference on File and Storage Technologies (Fast ’16) in Santa Clara by Bianca Schroeder, associate professor at the University of Toronto, and Google engineers Raghav Lagisetty and Arif Merchant. 

The study, which assessed six years of flash storage reliability in Google’s datacentres, reported that compared with traditional hard disk drives, flash drives have a significantly lower replacement rate in the field. However, they have a higher rate of uncorrectable errors. 

The research looked at reads and writes of the same data across a range of flash drives, covering different generations of flash technology. All the systems tested used the same error correction code. The study found the most common errors were read errors that could not be resolved even after retrying the operation. These are called non-transparent errors. 

According to the research paper, write errors rarely turn into non-transparent errors. “Depending on the model, 1.5- 2.5% of drives and one to four out of 10,000 drive days experience a final write error – a failed write operation that did not succeed even after retries,” said the paper. 

The researchers surmised that the difference in the frequency of final read and final write errors is probably due to the fact that a failed write will be retried at other drive locations. The report stated: “So while a failed read might be caused by only a few unreliable cells on the page to be read, a final write error indicates a larger-scale hardware problem.” 

The study found that up to 80% of drives developed bad data blocks and 2-7% of the drives tested developed bad Nand memory chips during the first four years of their life. These, according to the researchers, are drives that, without mechanisms for mapping out bad chips, would require repairs or to be returned to the manufacturer. 

When they looked at the symptoms that led to the chip being marked as failed, across all models, about two-thirds of bad chips were declared bad after reaching the 5% threshold on bad blocks. Interestingly, at the time of the study, the researchers noted that the bad chips that saw more than 5% of their blocks fail were chips that had actually violated manufacturer specifications. 

The study’s conclusion was that between 20% and 63% of drives experience at least one uncorrectable error during their first four years in the field, making uncorrectable errors the most common non-transparent error in these drives. 

A year before the study on Google’s datacentres, researchers from Carnegie Mellon University and Facebook looked into the reliability of flash-based SSDs. This study reported that the more data that is transmitted over the computer’s PCI-Express (PCIe) bus, to and from the flash storage device, the greater the power used on the bus, and the higher the SSD temperature. The study found that higher temperatures lead to increased failure rates, but do so most noticeably for SSDs that do not employ throttling techniques that reduce data rates. 

More recently, the 2018 paper The errors in flash-memory-based solid-state drives: analysis, mitigation, and recovery from Seagate Technology, Carnegie Mellon University and ETH Zurich, highlighted the fact that higher storage density leads to higher error rates and failures.  

“As the underlying Nand flash memory within SSDs scales to increase storage density, we find that the rate at which raw bit errors occur in the memory increases significantly, which in turn reduces the lifetime of the SSD,” the researchers warned. 

Undoubtedly, SSDs are more reliable than traditional enterprise disk arrays. “Traditionally, if you wanted to replace an enterprise hard drive in a disk array, it required a £5,000-a-day engineer to come on site to remap the array with the new drive,” says Rob Tribe, senior systems engineering director at Nutanix. 

He says the firmware in SSD drives constantly looks at the write cycle, and when it finds failures, it fences off blocks of memory that are producing too many errors. As IT becomes more automated, Tribe acknowledges that the metrics the SSD firmware provides are not yet being fully utilised by operating system software.  

Reliability is measured by failed input-output operations – read and writes to the drive. But, going forward, SSD firmware could be used in the future to carry out preventative maintenance, in which the system itself can closely track and monitor how the SSD wears. 

Go to Source