Aligning filesystems to an SSD’s erase block size

February 20th, 2009 in Filesystems, Linux, SSD

I recently purchased a new toy, an Intel X25-M SSD, and when I was setting it up initially, I decided I wanted to make sure the file system was aligned on an erase block boundary. This is a generally considered to be a Very Good Thing to do for most SSD’s available today, although there’s some question about how important this really is for Intel SSD’s — more on that in a moment.

It turns out this is much more difficult than you might first think — most of Linux’s storage stack is not set up well to worry about alignment of partitions and logical volumes. This is surprising, because it’s useful for many things other than just SSD’s. This kind of alignment is important if you are using any kind of hardware or software RAID, for example, especially RAID 5, because if writes are done on stripe boundaries, it can avoid a read-modify-write overhead. In addition, the hard drive industry is planning on moving to 4096 byte sectors instead of the way-too-small 512 byte sectors at some point in the future. Linux’s default partition geometry of 255 heads and 63 sectors/track means that there are 16065 (512 byte) sectors per cylinder. The initial round of 4k sector disks will emulate 512 byte disks, but if the partitions are not 4k aligned, then the disk will end up doing a read/modify/write on two internal 4k sectors for each singleton 4k file system write, and that would be unfortunate.

Vista has already started working around this problem, since it uses a default partitioning geometry of 240 heads and 63 sectors/track. This results in a cylinder boundary which is divisible by 8, and so the partitions (with the exception of the first, which is still misaligned unless you play some additional tricks) are 4k aligned. So this is one place where Vista is ahead of Linux…. unfortunately the default 255 heads and 63 sectors is hard coded in many places in the kernel, in the SCSI stack, and in various partitioning programs; so fixing this will require changes in many places.

However, with SSD’s (remember SSD’s? This is a blog post about SSD’s…) you need to align partitions on at least 128k boundaries for maximum efficiency. The best way to do this that I’ve found is to use 224 (32*7) heads and 56 (8*7) sectors/track. This results in 12544 (or 256*49) sectors/cylinder, so that each cylinder is 49*128k. You can do this by doing starting fdisk with the following options when first partitioning the SSD:

# fdisk -H 224 -S 56 /dev/sdb

The first partition will only be aligned on a 4k boundary, since in order to be compatible with MS-DOS, the first partition starts on track 1 instead of track 0, but I didn’t worry too much about that since I tend to use the first partition for /boot, which tends not to get modified much. You can go into expert mode with fdisk and force the partition to begin on an 128k alignment, but many Linux partition tools will complain about potential compatibility problems (which are obsolete warnings, since the systems that would have booting systems with these issues haven’t been made in about ten years), but I didn’t needed to do that, so I didn’t worry about it.

So I created a 1 gigabyte /boot partition as /dev/sdb1, and allocated the rest of the SSD for use by LVM as /dev/sdb2. And that’s where I ran into my next problem. LVM likes to allocate 192k for its header information, and 192k is not a multiple of 128k. So if you are creating file systems as logical volumes, and you want those volume to be properly aligned you have to tell LVM that it should reserve slightly more space for its meta-data, so that the physical extents that it allocates for its logical volumes are properly aligned. Unfortunately, the way this is done is slightly baroque:

# pvcreate –metadatasize 250k /dev/sdb2
Physical volume “/dev/sdb2″ successfully created

Why 250k and not 256k? I can’t tell you — sometimes the LVM tools aren’t terribly intuitive. However, you can test to make sure that physical extents start at the proper offset by using:

# pvs /dev/sdb2 -o+pe_start
PV VG Fmt Attr PSize PFree 1st PE
/dev/sdb2 lvm2 – 73.52G 73.52G 256.00K

If you use a metadata size of 256k, the first PE will be at 320k instead of 256k. There really ought to be an –pe-align option to pvcreate, which would be far more user-friendly, but, we have to work with the tools that we have. Maybe in the next version of the LVM support tools….

Once you do this, we’re almost done. The last thing to do is to create the file system. As it turns out, if you are using ext4, there is a way to tell the file system that it should try to align files so they match up with the RAID stripe width. (These techniques can be used for RAID disks as well). If your SSD has an 128k erase block size, and you are creating the file system with the default 4k block size, you just have to specify a strip width when you create the file system, like so:

# mke2fs -t ext4 -E stripe-width=32,resize=500G /dev/ssd/root

(The resize=500G limits the number of blocks reserved for resizing this file system so that the guaranteed number size that the file system can be grown via online resize is 500G. The default is 1000 times the initial file system size, which is often far too big to be reasonable. Realistically, the file system I am creating is going to be used for a desktop device, and I don’t foresee needing to resize it beyond 500G, so this saves about a 50 megabytes or so. Not a huge deal, but “waste not, want not”, as the saying goes.)

With e2fsprogs 1.41.4, the journal will be 128k aligned, as will the start of the file system, and with the stripe-width specified, the ext4 allocator will try to align block writes to the stripe width where that makes sense. So this is as good as it gets without kernel changes to make the block and inode allocators more SSD aware, something which I hope to have a chance to look at.

horizontal separator

All of this being said, it’s time to revisit this question — is all of this needed for a “smart”, “better by design” next-generation SSD such as Intel’s? Aligning your file system on an erase block boundary is critical on first generation SSD’s, but the Intel X25-M is supposed to have smarter algorithms that allow it to reduce the effect of write-amplification. The details are a little bit vague, but presumably there is a mapping table which maps sectors (at some internal sector size — we don’t know for sure whether it’s 512 bytes or some larger size) to individual erase blocks. If the file system sends a series of 4k writes for file system blocks 10, 12, 13, 32, 33, 34, 35, 64, 65, 66, 67, 96, 97, 98, 99, followed by a barrier operation, a traditional SSD might do read/modify/write on four 128k erase blocks — one covering the blocks 0-31, another for the blocks 32-63, and so on. However, the Intel SSD will simply write a single 128k block that indicates where the latest versions of blocks 10, 12, 13, 32, 33, 34, 35, 64, 65, 66, 67, 96, 97, 98, and 99 can be found.

This technique tends to work very well. However, over time, the table will get terribly fragmented, and depending on whether the internal block sector size is 512 or 4k (or something in between), there could be a situation where all but one or two of the internal sectors on the disks have been mapped away to other erase blocks, leading to fragmentation of the erase blocks. This is not just a theoretical problem; there are reports from the field that this happens relatively easy. For example, see Allyn Malventano’s Long-term performance analysis of Intel Mainstream SSDs and Marc Prieur’s report from BeHardware.com which includes an official response from Intel regarding this phenomenon. Laurent Gilson posted on the Linux-Thinkpad mailing list that when he tried using the X25-M to store commit journals for a database, that after writing 170% of the capacity of an Intel SSD, the small writes caused the write performance to go through the floor. More troubling, Allyn Malventano indicated that if the drive is abused for too long with a mixture of small and large writes, it can get into a state where the performance degredation is permanent, and even a series of large writes apparently does not restore the drive’s function — only an ATA SECURITY ERASE command to completely reset the mapping table seems to help.

So, what can be done to prevent this? Allyn’s review speculates that aligning writes to erase write boundaries can help — I’m not 100% sure this is true, but without detailed knowledge of what is going on under the covers in Intel’s SSD, we won’t know for sure. It certainly can’t hurt, though, and there is a distinct possibility that the internal sector size is larger than 512 bytes, which means the default partitioning scheme of 255 heads/63 sectors is probably not a good idea. (Even Vista has moved to a 240/63 scheme, which gives you 8k alignment of partitions; I prefer 224/56 partitioning, since the days when BIOS’s used C/H/S I/O are long gone.)

The Ext3 and Ext4 file system tend to defer meta-data writes by pinning them until a transaction commit; this definitely helps, and ext4 allows you to configure an erase block boundary, which should also be helpful. Enabling laptop mode will discourage writing to the disk except in large blocks, which probably helps significantly as well. And avoiding fsync() in applications will also be helpful, since a cache flush operation will force the SSD to write to an erase block even if it isn’t completely filled. Beyond that, clearly some experimentation will be needed. My current thinking is to use a standard file system aging workload, and then performing an I/O benchmark to see if there has been any performance degradation. I can then vary various file system tuning parameters and algorithms, confirm whether or not a heavy fsync workload makes the performance worse.

In the long term, hopefully Intel will release a firmware update which adds support for ATA TRIM/DISCARD commands, which will allow the file system to inform the SSD that various blocks have been deleted and no longer need to be preserved by the SSD. I suspect this will be a big help, if the SSD knows that certain sectors no longer need to be preserved, it can avoid copying them when trying to defragment the SSD. Given how expensive the X25-M SSD’s are, I hope that there will be a firmware update to support this, and that Intel won’t leave its early adopters high and dry by only offering this functionality in newer models of the SSD. If they were to do that, it would leave many of these early adopters, especially your humble writer (who paid for his SSD out of his own pocket), to be quite grumpy indeed. Hopefully, though, it won’t come to that.

Update: I’ve since penned a follow-up post “Should Filesystems Be Optimized for SSD’s?”

Share and Enjoy:

No related posts.

Trackback URI | Comments RSS

111 Responses to “ Aligning filesystems to an SSD’s erase block size ”

# 1 glandium Says:
February 20th, 2009 at 11:16 am
In the end, wouldn’t it just be better if SSDs were plain raw memory and that filesystems were to do all the intelligent stuff ?
# 2 jorge Says:
February 20th, 2009 at 11:27 am
Thanks for the detailed analysis of this problem, I just purchased the Intel drive about 24 hours before the report came out. :-/
# 3 tytso Says:
February 20th, 2009 at 11:40 am
@1: In the end, wouldn’t it just be better if SSDs were plain raw memory and that filesystems were to do all the intelligent stuff ?

Glandium,

Linux does have a flash-cognizant file system, ubifs, if the system has direct access to raw flash. In the PC world, though, backwards compatibility is critically important; the typical PC BIOS doesn’t know how to boot off of raw flash, and Windows (which still has 85% of the market) doesn’t know how to use raw flash. So if you want a few 80 gigabytes of flash in a laptop, today the only way to get it is with it attached to a SATA interface in a 2.5″ laptop disk form factor. Which means you don’t have direct access to raw flash…

It may be that first generation SSD’s (which are cheaper) with more intelligent file systems would make sense, but given the huge number of Windows machines, it’s not clear how long 1st generation SSD’s will be available in the market. So it makes sense that we add more smarts into the file system which optimizes for a certain amount of intelligence in the SSD’s. Remember, those of us who are Linux file system developers have relatively little ability to impact what Intel and other SSD vendors will be making available to the market. We can try making some suggestions, but at the end of the day it’s probably more useful for us to adapt to what they are doing — assuming we can figure it out or we can get low-level implementations out of them, along with warnings about what aspects of their internal implementation are likely to be around in future versions of their SSD’s, and what are initial implementation bugs that will likely disappear in future firmware updates or in future versions of their products.
# 4 Olaf van der Spek Says:
February 20th, 2009 at 12:59 pm
> Vista has already started working around this problem, since it uses a default partitioning geometry of 240 heads and 63 sectors/track.

Are partition tables still not using LBAs?
# 5 tytso Says:
February 20th, 2009 at 1:07 pm
@4: Are partition tables still not using LBAs?

Olaf,

The MS-DOS partition table stores partition boundaries two ways; one way using a starting LBA and the partition length, and another way as the starting and ending cylinder/head/sector values. If the partition is located at an LBA beyond what can be addressed by C/H/S values, a placeholder value (basically the largest possible C/H/S value) is stored in those fields in the partition table.

As far as I know, no BIOS or OS pays attention to the C/H/S values for about 10 years — however, most partition table programs (including all Linux’s partition editors) will by default create partitions so they begin and end on cylinder boundaries. All of Linux’s partition editors as far as I know will also complain vociferously if the LBA values do not match up with the expected C/H/S fields.

So the real problem here is that thanks to backwards compatibility, a really broken partition table scheme that was first used 30 years ago is still enforcing legancy constraints that became obsolete at least 10+ years ago. And given that no one has bothered to update Linux’s partition table editors, we are using a partition layout which is decidedly non-optimal for next generation storage devices (i.e., 4k sector disks and SSD’s), and which doesn’t work all that well even for some current-generation storage techniques (i.e., RAID).
# 6 Olaf van der Spek Says:
February 20th, 2009 at 2:02 pm
That shouldn’t be a problem, the FS could insert some padding to align itself.
# 7 glandium Says:
February 20th, 2009 at 2:16 pm
Olaf: technically, filesystems don’t have a clue where they are on the disk… inserting some padding would need to be manual, like Ted did with LVM. Still better to fix the partitions making tools.
# 8 Joey Hess Says:
February 20th, 2009 at 2:19 pm
Alignment issues aside, many people suggest using ext2 on SSDs, rather than ext3 or (presumably) ext4. I suspect that conventional wisdom may not be entirely right. It would be good to get a definative statement..
# 9 Olaf van der Spek Says:
February 20th, 2009 at 2:29 pm
> technically, filesystems don’t have a clue where they are on the disk…

I’m sure that could be changed.

> inserting some padding would need to be manual, like Ted did with LVM. Still better to fix the partitions making tools.

Wouldn’t something like GPT be fine?
# 10 tytso Says:
February 20th, 2009 at 3:41 pm
@8: It would be good to get a definative statement…

Joey,

For first generation SSD’s, the overhead of the journal is going to hurt the SSD performance, no question. So the conventional wisdom to use ext2 made sense for those devices — although the problem with using ext2 is that if the system crashes, the filesystem may not be consistent and running e2fsck, while it usually works, is not absolutely guaranteed to bring the system back to working order.

For next-generation SSD’s (of which Intel’s X25-M is the only one available on the market, as far as I know), the flash drive effectively has its own technology for minimizing the write amplification effects, which significantly mitigates the performance and wear issues of a large number of random writes. For example, using iostat, I can measure the amount of writes to my hard drive on a daily basis, and with an e-mail/kernel compilation workload, I’m averaging about 9GB/day, using ext4 and a journal. The X25-M is rated at 100GB/day for five years, so I’m not worried about wearing out the X25-M because of the effects of the journalling. So the X25-M significantly the costs of journalling on an SSD, and journalling has enough benefits that I’m quite willing to pay that cost.

After all, ext2 is definitely faster than ext3, but most people are willing to use ext3 on HDD’s because the benefits outweigh the costs. With next-generation SSD’s the same is true — and with ext4, there are features such as extents, delayed allocation, and RAID stripe-alignment that should further benefit performance on SSD’s. (And that’s before I start looking at further ways to boost SSD performance by making ext4 more SSD aware.)
# 11 Robin H Johnson Says:
February 20th, 2009 at 4:50 pm
Stop using MS-DOS partition tables.
Use GPT.
Most (if not all) the major distros include GPT support in their GRUB1/2.

The alignment problem is much easier to solve once you’re in GPT land – and it is a godsend on RAID already (where aggregate volumes can easily exceed 2TB).
# 12 tytso Says:
February 20th, 2009 at 5:26 pm
@11: Robin,

For Linux-only systems where you don’t need to worry about dual-boot, and where you are willing to put the boot loader in the MBR, I agree, GPT is probably a good solution. Pity distributions aren’t strongly encouraging GPT in such situations. I don’t know why; if I had to guess, it’s probably due to a combinatoric issue as far as testing is concerned.
# 13 Nik Says:
February 21st, 2009 at 11:38 am
Very interesting post, glad it made it to /. for me to find.

What are your thoughts on netbooks running SSD drives (I have an Eee 900A running Easy Peasy). If I go and reformat and work on this issue will I see a significant performance increase? My system included a 4G SSD but there is an aftermarket 16G drive I was looking to upgrade to. I haven’t done any research but it sounds like it would still be the “first generation” you talk about.

Thanks for your time Ted!
# 14 tytso Says:
February 21st, 2009 at 11:55 am
@13: Nik,

The 4 gig and 16 gig flash drive may not have even used a SATA interface; it may have used an ATA style interface much like a SD card. (There are plenty of 4 gig SD cards, and that’s what tends to be used in devices such as cell phones and the Nokia N800, for example. Those may have bandwidth limitations even when reading from the card imposed by the ATA interface. The generation number of different designs are somewhat subject to debate, but some might consider ATA-based flash devices “generation 0″.
# 15 srart Says:
February 21st, 2009 at 12:10 pm
http://www.ibm.com/developerworks/linux/library/l-flash-filesystems/

I think SS drives would pretty much work just like good ol’ CF cards, which is what this is really for.
# 16 tytso Says:
February 21st, 2009 at 12:21 pm
@15: srart,

Neither CF cards nor SSD drives can use the flash filesystems described in the IBM developers article which you referenced. CF cards, SD cards, and SSD’s have a flash translation layer which make them look like block devices.

Flash filesystems such as JFFS2 and UBIFS need access to the raw flash, via the MTD interface. See this description of UBIFS for more details.
# 17 Kragil Says:
February 21st, 2009 at 1:43 pm
Intel should hire a kernel hacker to that kind of work.

And you should get the SSD for free
# 18 Nik Says:
February 21st, 2009 at 2:07 pm
They use a mini PCIe SATA drive in the Eee PCs (at least the 900A series that I bought).

Here’s one of the drives I was thinking of upgrading to (32g SSD)
http://www.newegg.com/Product/Product.aspx?Item=N82E16820609406
# 19 Otimizando sistemas linux para discos de estado sólido « Blog Vivalivre Says:
February 21st, 2009 at 2:47 pm
[...] Fevereiro 21UTC 2009 por Filipi Vianna mjasay escreve: “Comecei recentemente a explorar maneiras de configurar discos de estado sólido (SSDs – Solid State Disks) para que eles funcionem d…. Em particular o novo Intel X25-M de 80G, cujo preço na rua caiu para algo em torno de $400 desta [...]
# 20 Jeff Breidenbach Says:
February 21st, 2009 at 4:14 pm
Is the X25-E easier to optimize against than the X25-M?
# 21 Robin H Johnson Says:
February 21st, 2009 at 4:56 pm
@12:tytso,

I admit I haven’t used Windows in years, but I’m pretty certain at least XP has the ability to use GPT as it’s also in the windows logical disk manager app. GPT is default on Intel-based Macs, so what OS does that leave?
# 22 Howard Chu Says:
February 21st, 2009 at 5:30 pm
I’ve been using 256/32 for the disk geometry on my SSDs, yielding a cylinder size of 4MB. Given that partitions/filesystems are naturally aligned on cylinder boundaries, this also gives you perfect alignment with any of the erase block sizes in common use today. I started with an OCZ CoreV2 last September and posted this partition alignment info on the OCZ forum but they seem to have censored the post in their recent reorg, oh well.

http://www.ocztechnologyforum.com/forum/search.php?do=finduser&u=35627
# 23 tytso Says:
February 21st, 2009 at 5:59 pm
@21: I admit I haven’t used Windows in years, but I’m pretty certain at least XP has the ability to use GPT as it’s also in the windows logical disk manager app.

From Microsoft’s Windows and GPT FAQ:

13. Can Windows XP x64 read, write, and boot from GPT disks?

Windows XP x64 edition can use GPT disks for data only. Only Windows for Itanium-based systems can boot from GPT partitions.

14. Can the 32-bit version of Windows XP read, write, and boot from GPT disks?

No. The 32-bit version will see only the Protective MBR. The EE partition will not be mounted or otherwise exposed to application software.

15. Can the 32-bit versions of Windows Server 2003 read, write, and boot from GPT disks?

All versions of Windows 2003 since Server Pack 1 can use GPT partitioned disks for data. Booting is only supported for Itanium-based systems.

16. Can Windows Vista and Windows Server 2008 read, write, and boot from GPT disks?

Yes, all versions can use GPT partitioned disks for data. Booting is only supported for EFI-based systems.

Given that most x86 and x86_64 machines are still not using EFI-compliant firmware, even Windows Vista will not boot off of GPT partitioned disks.
# 24 tytso Says:
February 21st, 2009 at 6:01 pm
@22: I’ve been using 256/32 for the disk geometry on my SSDs, yielding a cylinder size of 4MB.

Howard,

Is a 256 heads valid? The number of heads field is an 8-bit quantity, so I assume you’re filling in 0 and hoping implementations are interpreting that as 256. I’m not sure that’s guaranteed to work for all BIOS’s and all operating systems….
# 25 Howard Chu Says:
February 21st, 2009 at 6:20 pm
fdisk has always allowed 1-256 here; I don’t know what other BIOSs will do with that but Windows has no problem booting on this disk, and obviously Linux has no trouble either. Trying to dig up the ATA specs, will followup when I find a definitive answer.

And to summarize some other stuff I’ve written about adapting to these SSDs – I use laptop_mode all the time, with dirty_writeback and dirty_expire set to 10 minutes each, to minimize the occurrence of random writes. Probably most people wouldn’t feel too comfortable with potentially losing up to 10 minutes of work in a system crash, but kernel crashes pretty much never happen, and I’ve got ECC RAM for taking care of transient failures, and a UPS for power issues. It would be worthwhile to revisit the Sony MiniDisc filesystem, where all metadata was cached in RAM during the entire time the filesystem was mounted, and only re-written at dismount time. That eliminated a lot of the seeking/random writes that occur in regular fs usage (especially since the MiniDisc mainly held digital audio tracks – eliminate the metadata and all you have to do is sequential reads and writes).
# 26 Howard Chu Says:
February 21st, 2009 at 6:30 pm
There’s a nice description of the BIOS vs Interface limits tabulated here

http://www.allensmith.net/Storage/HDDlimit/Address.htm
# 27 Andy Carlson Says:
February 21st, 2009 at 6:54 pm
So, when you say the first track is taken for MSDOS compatibility, and that you created a 1GB partition, does that mean that the first partition is missing the first track than? Is that why fdisk shows the fist partition with a + (plus sign) behind the blocks? What I want to make sure about is that fdisk (or any disk partition program) takes that track into account and allocates the first partition minus the one track. Otherwise, it seems like all the partitions could be off by one track. I assume the way you talk, that that’s what happens, but wanted to confirm. Thanks.
# 28 Philip Pokorny Says:
February 21st, 2009 at 8:05 pm
I think you might want to double check your LVM configurations. 250 * 1024 ~ 256 * 1000

It may be that pvcreate is multiplying by 1024 when you specify ‘k’ and pvdisplay is dividing by 1000 (for k and 10^9 for G) So while you might think you’re on a 128 * 1024 sector boundry you’re really not.

Recent versions of parted will allow you to specify values in units of sectors, or explicit kiB and MiB if you want multiples of 1024 and not 1000. parted will also happily create msdos partitions at any LBA on the disk and generate the correct CHS values to match. And it doesn’t complain about cylinder boundaries either.
# 29 andy.edmonds.be › links for 2009-02-21 Says:
February 21st, 2009 at 8:33 pm
[...] Aligning filesystems to an SSD’s erase block size | Thoughts by Ted (tags: linux performance optimization filesystem ssd fdisk) [...]
# 30 mcfedr Says:
February 21st, 2009 at 8:40 pm
hmm, looking at this, and other such stuff, seems that at some point we need to let go of the legacy, and start something afresh, it seems there is much potenial for ssd, which will never be seen whilst it is limited in these ways
# 31 Michael Payne Says:
February 21st, 2009 at 10:39 pm
I like the Alice’s Restaurant joke there.
# 32 tytso Says:
February 22nd, 2009 at 12:13 am
@28: I think you might want to double check your LVM configurations. 250 * 1024 ~ 256 * 1000. It may be that pvcreate is multiplying by 1024 when you specify ‘k’ and pvdisplay is dividing by 1000 (for k and 10^9 for G) So while you might think you’re on a 128 * 1024 sector boundry you’re really not.

No, that’s not what is going on. All aside from the strangeness of pvcreate and pvdisplay using different meanings for ‘k’ even though they are from the same toolset, I know that’s not true for two reasons. First of all, it doesn’t explain why if pvcreate is given a –metadatasize of 256k, pvdisplay will show a starting extent of 320k. (256 * 1024 is no where near 320 * 1000). Secondly I can use dmsetup to look at the devicemapper translation table, and I can see that using pvcreate –metadatasize=250k really does cause things to align very nicely on an 128k erase block boundary — where k is 1024.
# 33 Anon Says:
February 22nd, 2009 at 7:37 am
With regard to an EeePC 900’s SSDs, Ted was correct in his guess about how they are connected – it is via (P)ATA not SATA. Here’s a small dmesg abstract:
ata2.00: CFA: ASUS-PHISON OB SSD, TST2.04P, max UDMA/66
ata2.00: 7880544 sectors, multi 0: LBA
ata2.01: CFA: ASUS-PHISON SSD, TST2.04P, max UDMA/66
ata2.01: 31522176 sectors, multi 0: LBA
ata2.00: configured for UDMA/66
ata2.01: configured for UDMA/66

I believe the main reasons why ext2 is recommended on such “gen 0″ devices is to try reduce the impact of fsync (which is truly crippling given the devices’ low write speed) coupled with their poor wear levelling (see http://valhenson.livejournal.com/25228.html for details) which cause them to go bad quickly.

It seems to be an open question as to how best to handle such devices (see http://lkml.org/lkml/2009/2/19/166 ) and it would be interesting to know whether alignment and setting the rotational flag has an impact on such devices (I guess this goes for commodity USB flash) or whether there is anything that can be done by filesystems to reduce the (mainly slow write induced) pain.
# 34 Anon Says:
February 22nd, 2009 at 8:04 am
You seem to be taking a different perspective to linus on the “adapting to the the disk technology” front (Linus seems to against having to have the OS know about disk boundaries and having to do levelling itself 0 http://tinyurl.com/8wpkfc ).

However do these changes make a difference today and if so should distros be making changes now or can they wait a bit and build in changes later?
# 35 Henk Poley Says:
February 22nd, 2009 at 10:07 am
Have you tried EasyCo’s MFT (”Managed Flash Technology”) ? It should improve small write IOPS tremendously. Commercial, and not open source though.
# 36 Should Filesystems Be Optimized for SSD’s? | Thoughts by Ted Says:
February 22nd, 2009 at 6:02 pm
[...] one of the comments to my last blog entry, an anonymous commenter writes: You seem to be taking a different perspective to linus on the [...]
# 37 Optimizing Linux for SSDs | The Linux Ninja Says:
February 22nd, 2009 at 10:09 pm
[...] Ts’o has an interesting article on his blog detailing how to align filesystems to an SSD’s erase block size in linux, which [...]
# 38 gunnar-eee Says:
February 23rd, 2009 at 6:39 am
What’s the net performance impact of clevery and consciously arranging perfect block boundary alignment compared to simply not caring about any of this at all ?
# 39 Sander Says:
February 23rd, 2009 at 7:05 am
I’ve read quite a bit about aligning the last few weeks. What I cannot seem to find an answer too is: is the requirement for aligning still here if one does not use partitions at all?

Situation: I boot from a 2GB ATA FLASH disk which includes an initramfs which in turn uses scripts to assemble a single RAID5 array on whole, unpartitioned disks.

With kind regards, Sander
# 40 tytso Says:
February 23rd, 2009 at 9:50 am
@38: What’s the net performance impact of clevery and consciously arranging perfect block boundary alignment compared to simply not caring about any of this at all?

Gunnar, it really depends on the SSD. For some SSD’s, this can make a huge difference. One report on the OCZ forums indicated that for an older version of the OCZ’s SSD, making sure the filesystem was properly aligned resulted in a 300% performance improvement, even though the person who did this was only expecting 10-15% improvement. This was on a Windows system, and for a much older, much more primitive SSD.

The claim by Intel is that erase block alignment shouldn’t matter on the X25-M SSD, since it is “better by design”. At least one reviewer is speculating that it still might be a good thing to do, however. Since I don’t have a spare crash-and-burn X25-M for doing experiments, I haven’t done these experiments for myself yet, so I can’t speak from personal experience on the X25-M. However, forcing an 128k alignment surely can’t hurt performance on the X25-M, and it might help.
# 41 tytso Says:
February 23rd, 2009 at 10:01 am
@39: Is the requirement for aligning still here if one does not use partitions at all?

Sander,

The alignment requirements don’t go away; but if you don’t use partitions, then you don’t have to worry about misalignments caused by the partition table. So if you use the whole disk, then whatever you place on /dev/sdb will automatically be aligned by definition.

However, if use LVM on /dev/sdb, you have to worry about misalignments caused by the LVM layer. If you use software raid using MD to stich together /dev/sdb, /dev/sdc, /dev/sdd, and /dev/sde, to create /dev/md0, then that’s all well and good, but now if you use LVM to break /dev/md0 into individual smaller logical volumes, you have to worry about misalignments caused by LVM layer again.

It’s important to think storage subsystem as a stack of interacting components. Depend on how you build up your storage stack, you might have partition tables, layered on top of MD, layered on top of LVM, etc. Each layer can potentially add alignment problems, and can potentially impose new alignment requirements. For example, if you use 5 flash drives with 128k erase block boundaries, and create a RAID5 device from those 5 flash drives, then you now have a stripe width of 512k (four times 128k plus the parity stripe), so anything layered on top of the RAID5 device should ideally be aligned to a 512k boundary.

P.S. If you’re rich enough to create a large RAID array using flash drives, can you slip me a few spare SSD’s? Clearly you’re flush enough with cash that you can afford it. kthxbye.
# 42 Chris Says:
February 23rd, 2009 at 11:00 am
I am new to the notion of optimizing hard drives and file systems, but read the entire post and comments and find the entire idea fascinating. LBA, GPT, EXT2/3/4 are all familiar terms but you’re putting them together with block sizes and hard drive firmware. Can any of you recommend a place to go to get up to speed on all of this?

Thanks in advance for any help,

Chris
# 43 Mikael Ståldal Says:
February 23rd, 2009 at 12:53 pm
> It seems to be an open question as to how best to handle such devices
> [the flash disk in Asus EeePC]

Mount the disk read only write to tmpfs:
http://www.staldal.nu/tech/2008/07/26/linux-with-mounted-read-only/
# 44 need some opinions on 32 vs 64 | keyongtech Says:
February 23rd, 2009 at 11:13 pm
[...] faster by not getting partitions on erase blocks/ boundaries. Talk about a just in time article. http://thunk.org/tytso/blog/2009/02/…se-block-size/ « Clock Inaccuracies, etc. (was: Partitioning External HD, [...]
# 45 MTRON Says:
February 24th, 2009 at 1:20 am
I bought a MTRON 1.8″ SSD for my VAIO UX.

Isn’t it simpler to do: FDISK -u /dev/hda
Now FDISK will talk sectors, so you can set a 64 (or whatever you like) sector offset.
It took me quite a while to figure that out. Why is it that people recommend to mess with C/H/S?
# 46 tytso Says:
February 24th, 2009 at 1:37 am
@44: Isn’t it simpler to do: FDISK -u /dev/hda

You can do that, but it gets annoying for subsequent partitions (I’m not that great at recognizing arbitrary multiples of 1024 in my head), and fdisk will kvetch at you if a partition doesn’t end on a cylinder boundary. It should be safe to ignore the warnings, but I find it more convenient to specify C/H/S geometrics that causes fdisk to work with me, instead of having me manually specify everything by hand and having fdisk complaining every step of the way.

I suppose if you are only dropping a single partition on the disk, using the ‘u’ command and simply specifying a starting sector of 256 or 512 (you don’t need 1024 unless you really want a 512k alignment just out of paranoia), and accepting the default “end of disk” as the last sector is simpler. Normally though I’m creating more than just the singleton partition on the disk; if I only need the single partition, I’d probably just create a filesystem using /dev/sdb directly, and dispense with the partition table altogether.
# 47 Nikanth Says:
February 24th, 2009 at 2:24 am
> It turns out this is much more difficult than you might first think — most of
> Linux’s storage stack is not set up well to worry about alignment of
> partitions and logical volumes.

By this do you meant the user-space tools on linux? Are there any place in the linux kernel where geometry of 255 heads and 63 sectors is assumed/used?
# 48 Sander Says:
February 24th, 2009 at 5:45 am
@41,

Dear Ted, thank you for your detailed response. Couldn’t be more clear.

If one only goes the MD software raid route and assumes 128k erase block boundaries, should the array be created with a chunk size of 128? (mdadm -C -c128 -l5 -n..)

Btw, I do not have a flash drive at all (save for the 2GB ATA), but I need to replace an array of rapidly dying Raptors and am looking at SSD for that. They seem to go down in price by the month.

With kind regards, Sander
# 49 tytso Says:
February 24th, 2009 at 8:50 am
@46: By this do you meant the user-space tools on linux? Are there any place in the linux kernel where geometry of 255 heads and 63 sectors is assumed/used?

Nikanth,

The kernel has an ioctl, HDIO_GETGEO, which returns the geometry of the disk to userspace. It is used as a default by a number of partitioning programs. For most block and scsi drivers, the geometry which is returned is totally fictional, but they tend to use the 255 heads / 63 sectors as the basis for their fiction.

If you do a “git grep getgeo” in the Linux source tree, you’ll find that it’s all over the place. Fixing a few places in the SCSI, ATA, and USB stacks would probably get the vast majority of the commonly used devices today, though.
# 50 Перевод “Aligning filesystems to an SSD’s erase block size” « 13 попугаев Says:
February 24th, 2009 at 8:11 pm
[...] Thoughts by Ted © 2007 All Rights Reserved. Оригинал (Английский): Aligning filesystems to an SSD’s erase block size Перевод: [...]
# 51 orangeudav Says:
February 24th, 2009 at 8:22 pm
translated to russian
http://orangeudav.ru/2009/02/25/perevod-aligning-filesystems-to-an-ssd’s-erase-block-size/
# 52 Peter Niemayer Says:
February 25th, 2009 at 12:41 pm
Did anybody notice the following paragraph from the data sheet (page 11) of the X25-M:

> 3.5.4 Minimum Useful Life
> A typical client usage of 20 GB writes per day is assumed. Should the host system
> attempt to exceed 20 GB writes per day by a large margin for an extended period, the
> drive will enable the endurance management feature to adjust write performance. By
> efficiently managing performance, this feature enables the device to have, at a
> minimum, a five year useful life. Under normal operation conditions, the drive will not
> invoke this feature.

That’s like BMW writing in their operations manual the car will ensure its 10-years-lifetime by enforcing a low maximum speed if you try to drive too long a distance per day.

Actually, this makes this disk unusable for any applications that require constant performance.

Maybe that’s intended to get people to buy more expensive SSDs for servers.
# 53 tytso Says:
February 25th, 2009 at 1:12 pm
@52: Peter,

Yeah, that is interesting. An interesting question is exactly what does “endurance management feature” actually do? The implication is that the SSD will trade off something (performance? data integrity guarantees?) in exchange for making sure the disk will last its 5 year useful life. But most of the things that you would want to do in order to reduce write amplification also improve performance, so you would want to do it all of the time anyway.

The only tradeoff I can imagine the SSD making is to stop honoring CACHE FLUSH requests, and trade off data integrity for lifetime. But that’s a very scary thing to do, and would probably open Intel up to any number of liability claims. And I can’t quite imagine Intel being that foolhardy.
# 54 Peter Niemayer Says:
February 25th, 2009 at 2:29 pm
@tytso: I think the section in the data sheet states quite clearly that the SSD will actually _slow_down_ writes voluntarily once the limits are hit. I guess it will just wait N milliseconds after each write before accepting the next write request to enforce the upper limit on data written per day.
# 55 Peter Niemayer Says:
February 25th, 2009 at 2:43 pm
BTW: What I wonder about is why it seems to be so difficult to produce NAND-flash at a reasonable price that has erase blocks no larger than 512 byte. I understand that the wiring of NAND-flashes makes them consume much less die space than NOR-flashes, but does the compromise on the size of erase blocks have to be THAT BIG? Wouldn’t erase blocks of size 512 bytes still allow to produce NAND flashes that are far cheaper than NOR-flashes?

This would immediately solve the filesystem-performance issue.

Well, maybe that is what the FusionIO drives use, at least they are much faster with small random writes than anything else I’ve seen. But then again, they _are_ ridiculously expensive…
# 56 tytso Says:
February 25th, 2009 at 4:26 pm
@54: I guess it will just wait N milliseconds after each write before accepting the next write request to enforce the upper limit on data written per day.

You may be right; that may be all Intel is doing. It would be highly disappointing if that were true, though.

@55: Well, maybe that is what the FusionIO drives use, at least they are much faster with small random writes than anything else I’ve seen. But then again, they _are_ ridiculously expensive…

My understanding is that their proprietary Linux driver contains something which is basically some kind of log-structured file system; if that is true, they are doing basically what Intel did with the X25-M, except they are doing it in software instead of hardware. The rest of their raw write bandwidth comes from writing to a large number of flash devices in parallel (i.e., the X25-M and X25-E write to 10 flash devices in parallel). My understanding is that Fusion IO is using SLC, like the X25-E, and is using a much larger level of parallelism, and then using a direct PCIe bus attachment to avoid getting bottlenecked by the SATA interface. At least, that’s what I’ve been told; please don’t take this as gospel without doing some independent verification…
# 57 Nikanth Says:
February 26th, 2009 at 2:06 am
Ideally, shouldn’t the partition boundary and the file-system block-size be based on the hw_sector_size? Shouldn’t the use of CHS be deprecated? Also for SSDs erase-block size should be used as the hw_sector_size for the device.

> If you do a “git grep getgeo” in the Linux source tree, you’ll find that
> it’s all over the place. Fixing a few places in the SCSI, ATA, and USB
> stacks would probably get the vast majority of the commonly used
> devices today, though.

If the drivers do not know the geometry shouldn’t they return error instead of a fictional value? These devices do not even have a CHS geometry!
# 58 Peter Niemayer Says:
February 26th, 2009 at 7:22 am
@56: The interesting thing about the FusionIO SSDs is that they perform good with small scattered writes (not as good as with sequential writes, though, of course), so if the proprietary driver does implement a log-structured filesystem, it’s a nice one, though ours haven’t aged enough yet for us to be entirely sure about that.
To me, the non-existence of open-source drivers is a no-go for broader usage, though. We have two 320GB ioDrives for testing, and they do use MLC NAND flashes (only the smaller drives < 320GB use SLC flashes).
# 59 Anonymouse Says:
February 28th, 2009 at 10:32 am
My debian unstable pvcreate (lvm) version 2.02.44 which is right now the latest, doesn’t take -metadatasize in any way shape or form, with double – in front of behind the destination partition.

But luckily, running ‘lvm’ and then entering ‘pvcreate –metadatasize 250k /dev/sdX’ works as expected.

Good luck to everyone
# 60 Andreas Sundstrom Says:
March 2nd, 2009 at 7:58 pm
In debian lenny pvcreate –metadatasize=250k works fine.

And a question…

Does anyone know how LUKS dm-crypt affects things, I get a performance drop when using dm-crypt. Even more if I put LVM ontop of it.

I have tried cryptsetup –align-payload=256 but I’m not sure it made an difference (trying to align on 128k boundary)
# 61 I Hate Waiting Ten Milliseconds « The Right Thing Says:
March 7th, 2009 at 11:01 pm
[...] operations — something that happens frequently if it’s hosting your operating system. There are numerous tweaks that can diminish or remove this problem, [...]
# 62 Anonymous Says:
March 11th, 2009 at 11:12 pm
Encrypted OS-independant backups with rsnapshot, TrueCrypt and NTFS…

On the following pages i will describe how i’ve implemented a part of my backup strategy. Namely how i backup my own data files onto an external SATA HDD in a safe (as in encrypted), compatible (readable from different OSs without special clients) an…
# 63 Ken Says:
March 25th, 2009 at 11:49 am
“My current thinking is to use a standard file system aging workload”

I’m curious … what workloads ARE considered standard file system aging workloads?
# 64 syphus Says:
March 25th, 2009 at 4:04 pm
ok second try, comment didn’t show up, sorry if this is duplicate:

Does anyone know if it is possible to run linux with no partitions? to format for example /dev/sda without partitioning it first, and putting /boot and everything else on that one formated filesystem?

Thanks.
# 65 Sander Says:
March 26th, 2009 at 6:17 am
Yes.

Just mkfs /dev/sda and mount /dev/sda /
# 66 james braselton Says:
March 26th, 2009 at 1:27 pm
HI THERE YOU RIGHT ABOUT THE BLOCK SIZE I DONT KNOW ANY THING ANY THING ABOUT ERASING A BLOCK BECAUSE LESS BLOCKS SHORTANS THE SSD OR FLASH MEMORY SOO THE SSD AND FLASH MEMORY NOW ARE USEING 1 MB TOO 4 MB BLOCKS NOT THE OTHER WAY AROUD AND BESIDES THE SSD AND FLASH MEMORY CAN LAST OVER 150 YEARS AT 24 7 OPERATION THE FUTURE SSD AND FLASH MEMORY MAY HAVE INFINT READ WRITE
# 67 james braselton Says:
March 27th, 2009 at 11:35 pm
HI THERE YOU ARE RIGHT THE AVERAGE SSD FLASH BLOCKS ARE NOW AROUND 256 KB WHILE SOME GO AS HIGH 1 TO 4 MB BLOCKS
# 68 valery Says:
April 1st, 2009 at 12:24 pm
Well, after reading through the whole thread I still have a question/uncertainty. Would use of GPT align all partitions with erase-block size? Would appreciate if somebody can clarify
# 69 Peter Says:
April 14th, 2009 at 7:05 am
It will be interesting to see if the new firmware intel has just released for these devices addresses any of the issues that you mention.

http://www.engadget.com/2009/04/13/intel-issues-firmware-update-for-ailing-x18-m-and-x25-m-ssds/
# 70 Howard Chu Says:
April 16th, 2009 at 3:47 pm
While aligning the partition to the erase block size is important, we also need to consider the flash page size. On a given chip the page size is typically 2KB. On an SSD with 8-10 chips operating in parallel that makes the smallest writable unit 16-20KB. Ideally you would set your filesystem pagesize to match this size, but chips like x86 only let you use 4K or 2/4MB page sizes, which isn’t really useful. So the alternative is to treat an SSD like a RAID array with a stripe size of 16KB or so. I.e., you want to force the host FS cache to do as much work as possible in coalescing I/Os before issuing them to the drive.
# 71 tommyp Says:
April 20th, 2009 at 1:07 am
@39 and @41

I’m I reading this correctly to say if I don’t setup any partitions on my ssd, and do a mkfs right to /dev/sda, I wll not have an alighnment problem?

I’m having a hard time picturing a fs with out a partition though. I can’t seem to find too many resources on the web that discuse the repercusions of this.
# 72 tytso Says:
April 28th, 2009 at 8:54 am
@71: I’m I reading this correctly to say if I don’t setup any partitions on my ssd, and do a mkfs right to /dev/sda, I wll not have an alighnment problem?

Yes, that’s correct. There’s nothing wrong with having a filesystem on a whole-disk, without a partition table. Linux will handle it just fine. It means you don’t get any of the benefits of partitioning, but if the SSD is relatively small to start with, and you were only going to have one partition on the disk anyway, using a whole-disk partition might be your best bet.
# 73 syphus Says:
April 28th, 2009 at 11:19 am
does any major linux distro have a howto for making a partitionless install?

I guess it could be done with LFS or Arch linux / gentoo, but I have been using ubuntu…

if anyone has a howto link for arch / ubuntu / debian please post it.

Thanks.
# 74 Anon Says:
May 7th, 2009 at 4:12 am
This is only tangentially related but MS have been posting what steps Windows 7 is taking to accommodate SSDs: http://blogs.msdn.com/e7/archive/2009/05/05/support-and-q-a-for-solid-state-drives-and.aspx . Interestingly it looks like they are using heuristics to identify them…
# 75 ehird Says:
May 8th, 2009 at 3:56 pm
Hi,

I was wondering if LVM is actually needed to align in the case of the simple setup:

ext3 /boot

ext4 everything-else

? It seems like LVM doesn’t actually help alignment there.

Also, a recent X25-M firmware update added TRIM support. Is there a way to enable this in recent kernels?
# 76 Starfry Says:
May 10th, 2009 at 4:08 pm
Hi Ted, thanks for this information. I have a new system with a latest firmware X25-M ready to take an Ubuntu install. I’d like to arrange the disk alignment before I proceed but some of this is a little unclear to me.

If I go with 224 heads and 56 sectors like you say, I don’t understand the significance of 32×7 and 8×7 in your text. Can you explain so I understand ?

Which is best 224/56 or 256/32 as suggested by Howard Chu? Can you explain why, just so I understand.

What would be the fdisk commands to set up /dev/sdb1 as /boot with 1Gb and the rest as /dev/sdb2 for lvm as in your example, and what does 1Gb mean here (is it 1024×1Mb or 1000×1Mb) ? Do you just use +1Gb in fdisk or do you calculate the number of cylinders. The whole heads/sectors/cylinders etc throws me off a little and the uncertainty around what 1Mb and 1Gb are does not help

If I wanted to have /boot as 500Mb or 250Mb (assuming 1Gb=1000Mb as it seems to be in fdisk) how would that change the commands ?

If I decided to put boot on another disk and dedicate all of /dev/sdb to LVM how do I do that (If I am not partitioning the disc then how do I specify heads/sectors, etc).

I found your article very interesting and would like to optimise my new drive so I would appreciate some help understanding the finer points.

many thanks
# 77 Starfry Says:
May 11th, 2009 at 6:50 pm
I’ve done a bit more reading and can see that 56 sectors gives 56 x 512 = 28762 bytes or 7 blocks of 4K in each cylinder. As partitions start on a cylinder boundary, I can see that specifying 56 sectors will make all partitions start on a 4K boundary.

I don’t see the significance of 224 heads: the geomerty works also with 255/56. Why do you choose 224 heads? What is the impact of leaving heads=255?

Also, I’ve seen elsewhere (on OCZ forums) mention of using H=32 and S=32 which gives nice powers of two: 4 blocks of 4K per cylinder. Is there a reason that you don’t do this ?

Apologies for all the “noob” questions but I’d really appreciate your help to get my head around these concepts. I’ve never bothered to think about storage at this level before.
# 78 Ian Says:
May 24th, 2009 at 8:17 pm
Is the ata security erase command your talking about done with hdparm –security-erase ? You wrote it like people should know what your talking about, but if you google: “ata security erase” linux, all you get is lots and lots of links to your blog post.
# 79 Ian Says:
May 24th, 2009 at 8:22 pm
and if so, the manual for that command specifies a PWD argument and talks about it like its only supposed to be done after you do a –security-freeze. So do you need to do this? If not the man could use updating. Mine says version 8.9 at the bottom.
# 80 Ian Says:
May 25th, 2009 at 5:12 am
i found this, helped me find out how to do the SECURITY ERASE

http://ata.wiki.kernel.org/index.php/ATA_Secure_Erase
# 81 kamal Says:
May 28th, 2009 at 1:21 am
I was looking through whatever i could find to learn about the benefit there might be by
fixing the alignment(especially for FTL based SSDs), since without alignment also there were some reported results on the good performance of the Intel SSDs (Allen Malventano’s 2 articles on PC perspective). I have one question to you (this is a newbie question and might sound stupid).

As you have mentioned that creating a filesystem on a whole disk will not create any alignment problems. Can we do this then: using this configuration we test the read/write performance of the SSD and compare it with a partitioned (default scheme) SSD. Will that be a true assessment of the benefit of alignment ?

Since this sounds fairly easy to test I was wondering if anybody had tested this and noticed the difference.
# 82 J-Development» 100K+ IOPS on semi-commodity hardware Says:
June 1st, 2009 at 3:20 pm
[...] See: Aligning filesystems to an ssds erase block size/ [...]
# 83 Petrik_CZ Says:
June 4th, 2009 at 8:32 am
Hello, thanks for very good article. I have been searching the web for more info about partitions aligning and it looks like you are almost the only one who wrote something about it. I have same questions:
1) my SSD (CF card Pretec 233x) uses NAND flash K9LBG08U0M which has erase block size 512KB (kilo byte). This is supposed to be very common NAND in cheap SSDs. It also uses dual channel interleaved-mode NAND controller SM223TF. My first question is this: does this mean that erase block size is 512KB, or becase of dual-channel interleave access is it 1024KB? What about 8-channel OCZ drives which also uses this NAND flash? I tried to find some whitepaper for my controller I failed.

2) Lets suppose my SSD has 512KB erase block size. I partioioned my drive with fdisk -S 32 -H32 and created first partition staring on second cluster, so it starts on 1024 block (offset). Then I created FS with stripe-width=128. Are these numbers corret? please verify, there is hunders of posts on OCZ forum (www.ocztechnologyforum.com) regarding these numbers along with some benchmarks, but nobody knows what is right. Thans very much.
# 84 Tom Says:
June 9th, 2009 at 9:57 pm
After reading all this and playing with the config of my SSD on my Dell Mini 9, I;m now moving on to rebuilding my OpenFiler NAS.

I’m having real problems trying to find resources on RAID partition alignment for linux. I have found a few things with regards to windows and SQL server, but it doesn’t really help.

Can some one point me to some docs on Raid alignment in Linux? Short of that, does anyone want to help me experiment, and I will document it?
# 85 Aloisio’s blog site » Ensuring SSD alignment with parted tool Says:
July 20th, 2009 at 10:21 am
[...] http://thunk.org/tytso/blog/2009/02/20/aligning-filesystems-to-an-ssds-erase-block-size/ http://www.ocztechnologyforum.com/forum/showthread.php?t=48309 [...]
# 86 james braselton Says:
July 30th, 2009 at 2:50 pm
hi there i heard that erasing blocks is a bad thing dont they have a limit of erases before the ssd gose bad
# 87 Damon Hart-Davis Says:
October 5th, 2009 at 11:18 am
Hi,

Thanks for the fdisk alignment numbers: just used your receipe to lay out a 128GB USB Flash thumbdrive. Now must go back and re-do my SD card! Both for the rather spiffing SheevaPlug…

http://www.earth.org.uk/note-on-SheevaPlug-setup.html#storage

Rgds

Damon
# 88 Oliver Says:
December 6th, 2009 at 8:09 pm
Great writeup, Disk alignment can be important pretty much everywhere these days.

These thoughts have brought up many thoughts on my end as well, but I’ll think of those later. The first thing that I found interesting is that even though the recommendation is to align the FS to the SSD’s erase block size, how would one actually find it?

Some SSD cards specify it, some don’t. I was reading through Transcends 43 page pdf and couldn’t find any mention of it. Is it controller based? Firmware based? Amount of flash chips based? Flash chip itself based? I don’t think it would help trying to align the FS to the Erase block, if you get it wrong

Also, I found MFT (mentioned earlier here as well) aligned them to something sily like 30Mb.
Anyway, yeah, how does one obtain the EBS of a flashdrive.
# 89 robiwan Says:
December 8th, 2009 at 3:21 pm
if i don’t make partitions and do like # 71 tommyp said : mkfs right to /dev/sda
where is the best place to install the grub bootloader? also on /dev/sda or better on a second disk, e.g /dev/sdb
In other words: is the installing of grub on /dev/sda bad for the alignment on a ssd?
# 90 Nicholas K. Says:
December 8th, 2009 at 8:30 pm
Hello Ted!
This is a late reply, but as a PC hobbyist (and Linux user) I’ve really enjoyed your posts, in particular the ext4 and SSD-related stuff.

I’ve ordered an Intel G2 solid state drive and it should arrive in a few days, so in the meantime I decided to apply your instructions on a regular hard-drive and dd the result on the SSD once it arrives. However, I won’t be using LVM and this poses some problems. I’ve got two questions:
1) I used cfdisk and manually set ‘ -h 224 -s 56 ‘ but I had to create quite a few “logical” partitions. Doesn’t that screw up the alignment? According to Wikipedia the Extended Boot Record takes up some space…
http://en.wikipedia.org/wiki/Extended_boot_record
2) Is there any method to verify the alignment of the data of a data-filled, arbitrary partition regardless of the partitioning method used (MBR/GPT/LVM etc), at least when using ext3/ext4? (Without having to repartition or calculate by hand!) I’m thinking something like a benchmark-based program but it probably doesn’t exist yet (and it would probably require write access).

Thank you in advance. Greetings from Europe,
Nicholas K.

PS. Keep up your great kernel-hacking job!
# 91 Nicholas K. Says:
December 9th, 2009 at 1:55 pm
Regarding my previous comment and alignment verification in particular (sorry, apparently I’m not thinking clearly, too tired trying to set up PXE boot on a silly machine):

Maybe it would be possible to write a few files starting with a known magic value and then get their position on the disk relative to the start of the file system, using fs-specific commands? Then one would check the disk on a physical level with “dd skip=$(calculated position of fs start + position of files IN the fs, rounded at 128k multiples)” trying to find those magic numbers. But this has several drawbacks: It assumes that the fs is contiguously allocated (no LVM!) and that the SSD itself starts with an aligned (=full) block (a safe assumption, i think) and also still requires some calculations. Also it would be fs-specific. But it could be automated.

Still, trying a few values and observing the impact on performance would probably be easier, although time-consuming. Performance is what we care about after all. However, performance differences could be VERY small, since generation 2 drives tend to be a bit too “smart”. Do you know any reliable and suitable benchmark? (Maybe bonnie++ ?)

The problem with that method is that performance differences could be negligible (and thus not enough to check the alignment), but the impact of misalignment on the drive wearing could still be very significant.

Oh well! I hope that my new, shiny SSD is smart enough by itself! Sorry for the bad English,
Nicholas K.
# 92 oliver Says:
December 11th, 2009 at 11:53 am
I was actually thinking along the same lines myself #91.

What I was thinking of was, getting the LBA (which is ‘physical’ enough) addresses of certain bits.
E.g. query your partition start’s LBA, say LBA 4, which would be an alignment of 1k (512 *4) (this being for demonstration purposes). Then, query your md start block somehow, let’s say this returns anything other than LBA 4 you are miss-aligned. Next query your FS start blok (or lvm) and same routine.
Then you know that your starting bits are all on the same blocks and thus physically aligned. Of course all other offsets need to match then, e.g. stride size etc etc.
# 93 Yuhong Bao Says:
December 18th, 2009 at 2:11 pm
WD just released the first 4K sector disks, calling it Advanced Format.
tytso: Care to blog about this?
# 94 ZNiP Says:
December 19th, 2009 at 2:21 pm
I have a Samsung 128GB SSD in my laptop (model: MMCRE28G8MXP-0VBL1, firmware: VBM1EL1Q). I would like to create my partitions so that they are aligned to erase block boundaries as suggested, but I can’t find any information about this drive, like what the erase block size is. I would also like to know if this SSD (with this firmware version) supports TRIM, and if so, how use that feature with ext4. Thanks, any help is appreciated.
# 95 Anon Says:
December 21st, 2009 at 2:48 pm
Ted, has this improved? Storage vendors are apparently saying Linux/OSX/Vista+ have no issues (one assumes no performance loss) with 4k blocks – http://www.anandtech.com/storage/showdoc.aspx?i=3691 .
# 96 Elias Amaral Says:
January 7th, 2010 at 12:45 am
This applies to usb pen drives too? I am with LVM partitions and didn’t paid attention to this. I can only use firefox if I disable the cache. The UI delays I was experiencing was about 1~5 seconds!
# 97 oliver Says:
January 12th, 2010 at 8:29 pm
Re: #96

Absolutly. USB pen drivers have erase blocks as well. Trying to find them is hard though. I tried having no partition on them and just dumping an vfat filesystem on it, but the other OS didn’t agree. Ubuntu/linux worked just fine with it.

I went ahead and created a partition starting at 1024 (fdisk -u /dev/sdb) which should be okay.

If you are using it linux exclusive, or your other OS has no issues with un-partitioned disks, just don’t make a partition at all! if you use ext* you can specify a stride size still, I do think to have stride sizes for ext* makes them more efficient as bitmaps etc are on erase block sizes as well.
# 98 Solid state hard drives Says:
January 18th, 2010 at 6:08 am
SSDs need to get cheaper and we need larger capacities. I wont be using SSDs anytime soon. Will be sticking to SATA for my personal computers and SCSI for my servers.
# 99 A note about SSDs and partition alignment | John Lewis: IT Support Says:
February 18th, 2010 at 7:12 am
[...] 17266689s primary ext3 3 17580032s 125204480s 107624449s primary ext3 http://thunk.org/tytso/blog/2009/02/20/aligning-filesystems-to-an-ssds-erase-block-size/ [...]
# 100 Contempt Says:
April 5th, 2010 at 5:52 pm
Upgrading an Acer Aspire One D150 from an HDD to an SSD…

As I mentioned in my previous post, the hard disk in my Acer Aspire One D150 had some issues last week, to the extent that I don’t trust it anymore and planned to replace it with an SSD drive instead.

After soliciting advice from the good people on…
# 101 data transfer rate decreases with dd command? Says:
April 26th, 2010 at 8:20 pm
[...] buffer size as well to prevent the device from writing twice. This blog entry covers these items: http://thunk.org/tytso/blog/2009/02/…se-block-size/ To maximize throughput, you may need to make the sector size of the dd command match the size the [...]
# 102 The dawn of the 4k sector hard drive and its impact for the DBA. « Oracle Observations Says:
May 17th, 2010 at 9:11 am
[...] thunk.org: Aligning filesystems to an SSD’s erase block size [...]
# 103 f_p_ Says:
May 26th, 2010 at 11:03 am
One question I wondered is : does the swap also do TRIM ?

New SSD firmwares tend lo level wear all across the SSD memory. Consequently, I assume they don’t care about partition boundaries, in order to level the wear induced by intensive access on a small partition (aka swap) on blocs that would normally belong to a less frequently used partition (aka, an oversized /home).

So, in order for the wear to be evenly spread, the system should say what blocks in the swap partition are currently busy or not, knowing that, at system halt time, all of them have to be set as unused (that is, TRIMmed). The same should also hold for swap files, but cannot, because all at the filesystem level all pages of a swap file have to remain allocated (so file with “holes” in it), while most of their contents is irrelevant for the SSD disk, and could be reallocated elsewhere without copying the existing contents of the file. In that respect, swap partitions might provide better performance than swap files.

What do you think ?
# 104 El Blog de Marcelo! » Discos Rígidos con Sectores de 4KB en Linux Says:
June 13th, 2010 at 10:52 pm
[...] ¿cómo se hace para crear particiones de manera alineada? Es relativamente fácil. Según Ted Ts’o, donde explica que hay un problema parecido con los nuevos discos SSD, hay que ejecutar fdisk con [...]
# 105 unclear Says:
June 28th, 2010 at 6:47 pm
Great article. I am still unclear how you arrived at the -H 224 -S 56 for this SDD. I have read a few suggestions out there to use -H 32 -S 32 for this SSD. Can you comment on these settings and attempt to clarify my question? I have an X25-M (80 G). I would like to have two ext4 partitions: one for / and one for /home. I will use the SSD as /dev/sdb and keep my HDD as /dev/sda. The HDD will store /boot, swap, /var and /media/data.

Thank you!

Reference suggesting the -H 32 -S 32: http://www.ocztechnologyforum.com/forum/showthread.php?54379-Linux-Tips-tweaks-and-alignment&p=373226&viewfull=1#post373226
# 106 Partition alignment tester - hypershare.info Says:
July 1st, 2010 at 2:52 pm
[...] a pleasant confirmation that I did properly set up partitions on my desktop (Ted has a really good guide how to do that – just ignore the fact that he is talking about [...]
# 107 Problems with fdisk and Western Digital�s Advanced Format Says:
July 5th, 2010 at 6:00 am
[...] to use the parameters discussed here: http://thunk.org/tytso/blog/2009/02/…se-block-size/ You might also want to google for aligning sectors on the new drives with sector size >512 [...]
# 108 Discos Rígidos con Sectores de 4KB en Linux : Revistalinux.net, Linux, software libre, Ubuntu, programación Says:
July 27th, 2010 at 9:01 am
[...] ¿cómo se hace para crear particiones de manera alineada? Es relativamente fácil. Según Ted Ts’o, donde explica que hay un problema parecido con los nuevos discos SSD, hay que ejecutar fdisk con [...]
# 109 pro-gadget Says:
August 14th, 2010 at 2:34 am
That shouldn’t be a problem, the FS could insert some padding to align itself.
# 110 Nilfs2: A File System to Make SSDs Scream… Slowly | He Who Conquers the Left Side Says:
August 15th, 2010 at 3:09 pm
[...] I was never sure if I set up the alignment properly, and this article by Ted Tso made me even less sure. Additionally, it made sense to start backing up my system, now [...]
# 111 Solid-State Drives: What is a suggested partitioning scheme for using a small (~40GB) SSD large HD with desktop Linux? - Quora Says:
October 9th, 2010 at 2:49 pm
[...] two things:Formatting the disk so that the cylinder layout matches the internal memory pages. See http://thunk.org/tytso/blog/2009…Using LVM as an intermediate layer. See http://www.linuxconfig.org/choos…Insert a dynamic date [...]

Thoughts by Ted

Aligning filesystems to an SSD’s erase block size

111 Responses to “ Aligning filesystems to an SSD’s erase block size ”

Leave a Reply

About

Disclaimer

Meta

Recent Posts

Archives

Categories

Recent Comments

Feeds