To be perfectly clear: the 1000x claim isn't about *throughput*, but *latency*. ...

ChuckMcM · on June 14, 2016

Ok, a couple of things. First, I found it really amazing at how convoluted Intel plotted their numbers on that latency graph so that it would "Up and to the right" which everyone "knows" is good.

The second thing is that Intel has been claiming multiple GB/second of throughput as well. They really do believe this will be a replacement for DRAM on some platforms. As in you read into your L3 case from this stuff and you flush to it when you write out a dirty cache lines. And while that will make the overall system slower, it gives it literally instant stop/start capability if the key parts of your architecture are static (can retain data at 0 clock). What that means is a laptop that can turn it self off between waiting for sectors to read in from the disk or packets to come off the network, or keys to be pressed by the user. Non-illuminated run times in days off of a battery source rather than hours.

Imagine a 1.2TB of this stuff on the motherboard substituting for DRAM. So you've got every application and all the data for your applications already "in memory" as far as the chip is concerned. App switching? Instant, app data availability? instant. Quite a different experience than what we have today.

sspiff · on June 14, 2016

> What that means is a laptop that can turn it self off between waiting for sectors to read in from the disk or packets to come off the network, or keys to be pressed by the user.

Yeah, no. It might be able to turn of the CPU, but everything else would have to keep running. Graphics hardware does not deal well with frequent state changes. Network hardware would still need to run to keep its clock in sync with the signal.

So, you get to turn off a handful of components, which is where we already are today - CPUs spend most of their time in deep sleep states. USB devices and chipsets implement sleep states. Graphics hardware clocks down significantly.

visarga · on June 14, 2016

> Quite a different experience than what we have today.

For example, I only have to wait at most one second for an app to load, if it's not already open and minimized. It's comparable to the time it takes me to click a button. And my computer is rebooted only once every few months, so it's always up.

This improvement will be more meaningful for random access database operations.

simula67 · on June 14, 2016

Is this technology costly ?

EDIT: Does not seem to be.

“You could put the cost somewhere between NAND and DRAM. Cost per bit, it’s likely to be in between them somewhere. But actual cost will result from the products we bring to the marketplace.”

http://hothardware.com/news/intel-and-micron-jointly-drop-di...

slashdev · on June 13, 2016

DRAM does not have 1ns latency! On a 4ghz skylake chip you can access the L1 cache at 1ns. DRAM is going to be more in the range of 60(local)-100(remote) nanoseconds. Optane will likely be > 300ns latency from what I've heard, if you access it directly through the CPU memory controller via loads and stores, which is still very impressive.

chongli · on June 14, 2016

When we talk about 1ns latency for DRAM we aren't talking about access latency, we're talking about CAS latency[0]. This is how fast the DRAM actually works. It doesn't include the rest of the overhead between the time the data is sent over the bus and the time it's loaded into a register and ready for an instruction to operate on it.

CAS latency for DDR3-3300 is less than 1/3ns, according to the article below. If Intel's new Optane memory can achieve addressing on the order of 10ns, it will be very competitive with DRAM, since all of other overhead along the data path ought to be similar.

[0] https://en.wikipedia.org/wiki/CAS_latency

vitus · on June 14, 2016

This is actually slightly incorrect. You seem to be calculating as if the clock rate were 3300MHz, even though the listed number in DDR (double data rate) is actually double the actual clock rate (IIRC, for marketing reasons).

On top of that, the CAS latency is given in cycles -- the cited CAS latency for DDR3-3300 is 16 cycles. Multiplying # cycles by time per cycle yields a value closer to 9-10ns. Note that this is the "first word" latency in this table.

- The other thing you may be talking about is this "bit time", which looks like it's simply inverse throughput. It's very important not to conflate throughput with latency.

nneonneo · on June 13, 2016

I'm quoting from the graph in the article. The numbers are also meant to be "on the order of", since the graph only provides that level of detail - so, for example, 1ns really means "1-9ns".

I believe that they are using fairly best-case latency numbers, under the assumption that the memory accesses are close to sequential. For random memory accesses, as you note, the latencies are higher. Unfortunately the article doesn't go into detail on what Optane's worst-case latencies are, nor what the latencies will be like in a real functional system (mostly because Intel has only early prototypes to show).

slashdev · on June 14, 2016

Yeah, I wouldn't do that, that graph is hardly accurate. 9ns won't get you a L3 cache hit on most skylake chips, nevermind main memory.

For sequential access giving a latency number doesn't make sense - you need to talk throughput. You can get upwards of 20gb/s with a single core for DRAM. Here you're at least bounded by the 4x PCI interface (more likely the Optane device) so maybe 3GB/s if you're feeling generous.

Koromix · on June 13, 2016

Consumer OS I/O software stacks are completely unable to deal with this. It was fine for spinning drive latency, it's okay(ish) for high-performance SSD but the open/read/write model just cannot work that fast. It's several orders of magnitude off.

Assuming we keep the file system model, I'm guessing some kind of direct memory mapping is in order? Anyone knows what's ahead of us on the software side, to take advantage of this kind of latency?

jacquesm · on June 13, 2016

Files exist because persistent storage is slow, copying data from RAM to disk is slow and we'd like to have a way to refer by name to blobs of bytes, but in the end it is only there because it's a useful metaphor. If the metaphor gets in the way of our ability to use the storage medium you can either do a mapping (like the direct memory mapping you refer to), which in this case would almost be like a backwards compatibility layer, a more involved scheme where all RAM is backed by persistent storage and devices only keep a flag which objects are currently being worked on and which are stable (an 'open' file would simply be marked as currently in use by an application but there would be no explicit save, just an unlock), or a completely new system based on an image like architecture (see: smalltalk, some lisp implementations). There are some drawbacks to that model (such as harder collaborative development) but I see no reason that applications could not be packaged that way even if the development used a more traditional method.

It would be fairly trivial to layer a conventional filesystem on top of such a persistent object store to get the best of both worlds, which would - amongst others - give you instantaneous suspend and re-awaken and other goodies.

If true this will vastly change the way we use computers.

gumby · on June 14, 2016

This was the original architecture of Multics BTW -- the world just consisted of a single address space of segments (pages) plus a capability-based address space; a higher level construct gave you a named structure for a group of segments.

A similar approach has been taken with HP's interesting memristor-based "Machine"

There's a lot of interesting stuff in Organick's book on Multics, not all of which was implemented, unfortunately. And a lot of really good stuff was tossed overboard when fitting Unix into a PDP-7 (a lot of overwrought bad stuff was jettisoned too -- don't get the wrong idea!).

nneonneo · on June 13, 2016

This is a great question. One way I imagine it:

Short term, the disk controller becomes a peripheral on the memory bus. On a 64-bit x86-64 system, the top 16 bits of the address are either 0000 or ffff for RAM. Make it so that the prefix 1000 (for example) maps to the disk, so accessing (physical) address 1000000013371000 accesses byte 13371000 on the disk.

Now processes can just ask the OS to perform a physical memory mapping to obtain a range of virtual addresses directly backed by disk pages, with page protections set based on their filesystem permissions. Such physical address mapping interfaces already exists in most OSes to support memory mapped I/O (for example, mapping /dev/mem in Linux).

This addressing scheme has another advantage: other devices on the system can use e.g. DMA to directly talk to the disk without any CPU intervention. For example, the GPU could load textures straight off of disk, just like John Carmack wants.

Medium term, we start rethinking the filesystem. If we make the address range for a given disk completely persistent, we can just put pointers to disk bytes on the disk itself. Processes will use the same virtual addresses as the physical addresses when talking to the disk. Suddenly "serialization" to disk is no longer required: data structures can be stored in native form directly on the disk. Imagine having a "dmalloc" function call hand you a chunk of persistent storage which you treat the same as any memory, but which can outlive the process. Similar concepts exist in some languages (like MUMPS), and now we bring the idea to all programming environments.

Long term, RAM ceases to be an independent entity, and merely becomes OS-managed cache for the big persistent storage (assuming it still has any latency/bandwidth advantages by this point). Now you can get rid of the notion of "shutting down" or "starting up" the system: everything is persistent. Without having to constantly refresh DRAM to keep the system alive, devices can "sleep/hibernate" more frequently and readily, saving significant power. Programming models become nearly unrecognizable as old models of memory management and process lifetimes give way to new models of persistent storage management and eternal services.

We're not far off from seeing a potential revolution in computing here.

imtringued · on June 14, 2016

>Suddenly "serialization" to disk is no longer required

That's not true. You'd still want excel to just save your sheets as xls file and show it to a coworker as opposed to sharing the entire excel program state. Upgrading from one version to another also requires a stable persistent data format for things such as configurations.

Not to mention that you still need serialization to communicate over a network.

nneonneo · on June 14, 2016

Yes, you're definitely right - "serialization" as a concept is going to still be needed for data interchange. I just didn't want to mention it because that's just status quo, not part of the new computing model.

In my original comment, I simply meant that you would no longer have to serialize your data to your own disk.

yourapostasy · on June 14, 2016

That's almost a weird amalgam of really old school computer architectures and new.

It seemed to me from the layperson descriptions of how the old tube-based systems were operated, some of the old architectures had no notion of files at all, it was all RAM all the time. I also read stories of when rotating media first started being used, programmers wrote data not sequentially on the media but in an optimized pattern so by the time the cpu pulled a value into RAM the read head was positioned multiple units over, and so data was written out according to the demands of the hardware, and now we're talking about taking the cpu out of most of the loop.

We aren't coming full circle with the new memory architectures we're exploring, but kind of a spiral in a "history doesn't repeat but rhymes" manner by dusting off some old techniques and putting new spins on them, pun not intended.

If always-persistent state becomes popular, then I wonder if that will revive the popularity of image-based environments, a la Smalltalk and Lisp Machines, along with the collaboration challenges those introduced.

benlwalker · on June 13, 2016

http://pmem.io - for byte addressable form factors

http://spdk.io - for block device form factors

cleech · on June 13, 2016

Memory addressable non-volatile storage instead of SSDs that process block storage read/write commands, and file-system improvements to allow direct mmap of file contents from NVM into your process address space without copying through RAM.

gjulianm · on June 13, 2016

I'm curious, why do you say that the open/read/write model can't work? I suppose that current software should be optimized to take advantage of better performance, but I don't see why it is obsolete.

nneonneo · on June 13, 2016

10ns latency means that a file entry/inode/block is available just 10-30 CPU cycles after it is requested. Suddenly the 1000 CPU cycles spent dispatching the system call look very inefficient by comparison.

Obsolete might not be the right term, but there is essentially no way that open/read/write can take proper advantage of this kind of low latency storage. We are going to need a new way to abstract and interface with this persistent storage.

bdonlan · on June 13, 2016

Read/write are certainly a bit slow, but this may well be acceptable for most applications. High performance apps can use mmap, which can be extended to directly map the nonvolatile memory in question.

gpderetta · on June 13, 2016

Or a way to make system calls faster, some sort of vdso with the ability to switch privilege level.

jdub · on June 14, 2016

That's how __vsyscall works now.

gpderetta · on June 14, 2016

__vsyscall simply abstracts away the actual system call invocation strategy (int, sysenter, syscall), but doesn't change the model at all.

Anyway, what I was thinking was similar to call gates, which were phased out for being slow. Probably just making syscall faster in the CPU and reducing the overhead kernel side would be enough.

rubber_duck · on June 13, 2016

What's wrong with plain old mmap ?

Koromix · on June 13, 2016

mmap() tricks you into thinking you can directly access the disk. What really happens when you access the mapped data is that the CPU generates a page fault, the OS takes control, copies the data from disk to physical memory and maps the page in your address space. So essentially it is a convient way to do read().

The mmap() model is good. The current page fault mechanism, not so much.

rayiner · on June 13, 2016

The difference is that you only take the page fault once per page (or less than once if you do some prefetching in the kernel), while you have to do a system call for every read().

_wmd · on June 13, 2016

It costs a fortune to map a file on every OS I've tried, it makes little sense unless you're consuming a lot of data from few files rather than little data from many files

Qantourisc · on June 14, 2016

The main reason to keep it are: 1) network 2) what if we uncover a faster cpu/mem ? We would need to reintroduce file-systems. I'd say it's safe to deprecate it after about 30 years after it become obsolete. (So we are sure nothing will surpass it any time soon.)

wmf · on June 13, 2016

http://pmem.io/

imtringued · on June 14, 2016

For now. If you build it they will come. Things like these take time but they won't happen if the hardware guys don't move first.

coldtea · on June 13, 2016

>Consumer OS I/O software stacks are completely unable to deal with this. It was fine for spinning drive latency, it's okay(ish) for high-performance SSD but the open/read/write model just cannot work that fast. It's several orders of magnitude off.

Citation needed.

jacquesm · on June 13, 2016

It's a game changer either way, and it will be one more nail in the coffin of anything mechanical.

zer00eyz · on June 13, 2016

For consumer, and "online" hardware I fully agree, but were going to have tape for a long time to come.

As a medium, for offline, offsite it still reigns king. Density, reliably, and durability remain high, vs platter based drives or SSD.

I know of at least two companies that keep tape going for compliance reasons. At the end of each year, they do a pretty large archival snapshot, and send it off to storage. It will remain there till some lawyer asks for it, and will most likely end up copied by a third party for the sake of integrity.

For one of them, the archive tapes are becoming a concern, the media is fine, but finding hardware is going to become a problem.

a_imho · on June 14, 2016

There are some interesting regulations in finance, I know at least one bank that keeps offline backups on tapes, but wonder how often do they test it.

My experience with them tells me, they are not the most tech savvy bunch. If they ever need to use backups with the software they were provided decades ago / by modern fast failing agile company (according to taste), even if they can get the source it will be very hard to compile and run them. Reproducible builds is not exactly a solved problem.

ashitlerferad · on June 15, 2016

> Reproducible builds is not exactly a solved problem.

Indeed, but we are getting there (over 90% of packages from Debian stretch amd64 and 99.9% of OpenWRT packages):

https://tests.reproducible-builds.org/debian/ https://tests.reproducible-builds.org/openwrt/

y4mi · on June 14, 2016

they need to put their entire environment into docker container and archive the images! theyll be able to recreate their whole environment at any time in the future!

...ill see myself out

Joeri · on June 13, 2016

SSD's haven't displaced drives for bulk data center storage because it's all about $/GB and although SSD's are getting cheaper, spinning disks got cheaper in the same way. Intel is not going to compete on price, but on performance. So rest assured that spinning disks will be with us for a while.

jacquesm · on June 13, 2016

Bulk data is the key there, for many other use cases servers are already using SSDs and once the economies of scale will cause another order of magnitude decrease in cost I think spinning disks will be much more expensive in $/GB than SSDs. Chips are cheap, robust, quite dense and should have a much longer operating life than anything mechanical. The writing is on the platters.

vitus · on June 14, 2016

Not just this; HDD technology has been stagnating, while SSD technology still has yet to reach its full potential. In the last five years, HDDs haven't really gotten cheaper (other than market returning to pre-flooding equilibrium). On the other hand, SSDs have gotten easily 4x cheaper in that same period of time.

Plus, since SSDs aren't using the latest process technology (as far as I'm aware), they still have a few more years of Moore's law ahead of them. Shrinking transistor size means increased density (and at lower cost, once you recoup the cost of the mask).

wtallis · on June 14, 2016

> Plus, since SSDs aren't using the latest process technology (as far as I'm aware), they still have a few more years of Moore's law ahead of them. Shrinking transistor size means increased density (and at lower cost, once you recoup the cost of the mask).

Nope. SSDs have already hit the wall and dug into it a ways. Everybody is doing 16nm/15nm NAND flash and struggling to get ~40nm 3D NAND out the door, but they can't beat the $/GB of planar NAND. Sub-20nm TLC flash has had serious problems with the data just plain leaking out of the memory cell, to say nothing of the scary low program/erase cycle counts. The only thing that really kept the SSD market advancing over the past year was the widespread adoption of better error correcting codes to cope with lower quality flash.

baq · on June 14, 2016

shrinking transistor size may also mean reduced durability, so it's not all roses. we'll see.

snuxoll · on June 14, 2016

If we halve the size of NAND memory cells we can afford to allocate an additional 20% of them for reallocation when other cells fail. It's not a perfect solution by any means, but it is a solution to increase the lifespan nonetheless.

marcusarmstrong · on June 13, 2016

I could see a world in which that changes due to a significant jump in reliability. If these chips are truly 1000x winners in write cycles, that changes the math significantly in terms of $/GB/hr, which is truly the important number for large scale data center storage.

Unklejoe · on June 14, 2016

I wonder what the actual latency improvements will be. If it's connected through PCI express, there will be a lower limit which is dictated by the bus. While reducing the latency of the drive will still certainly matter, this latency introduced by the bus will serve to lessen the relative improvements with respect to SSDs.

PCI Express is packetized and it takes a little over 1 nanosecond per byte for the 3.0 version. I believe the minimum packet size is 20 bytes, but I'm not positive on that. It seems like the latency could never be less than 2 * (1 * 20) nanoseconds in the case of a random access, and this is before the latency of the drive itself is factored in. Surely there will be a few clock cycles required for the drive HW to decode the PCIe transaction and act upon it.

That being said, any latency improvement of the drive _will_ have a direct reduction in effective latency in the end, so it’s all good news to me. I’m just curious about the 1000x figure that’s being referenced everywhere.

todd8 · on June 14, 2016

To put these latency numbers in perspective 1ns is about 11 inches at the speed of light.

umanwizard · on June 14, 2016

Or three cycles on a typical desktop CPU.

todd8 · on June 15, 2016

Yes, but even one round trip to the store could blow most of your latency budget. The store will have to be right next to the processor.