,ch15.13676 Page 412 Friday, January 21, 2005 11:04 AM CHAPTER 15 Chapter 15 Memory Mapping and DMA This chapter delves into the area of Linux memory management, with an emphasis on techniques that are useful to the device driver writer. Many types of driver programming require some understanding of how the virtual memory subsystem works; the material we cover in this chapter comes in handy more than once as we get into some of the more complex and performance-critical subsystems.
,ch15.13676 Page 413 Friday, January 21, 2005 11:04 AM the data structures used by the kernel to manage memory. Once the necessary background has been covered, we can get into working with these structures. Address Types Linux is, of course, a virtual memory system, meaning that the addresses seen by user programs do not directly correspond to the physical addresses used by the hardware. Virtual memory introduces a layer of indirection that allows a number of nice things.
,ch15.13676 Page 414 Friday, January 21, 2005 11:04 AM physical addresses differ only by a constant offset. Logical addresses use the hardware’s native pointer size and, therefore, may be unable to address all of physical memory on heavily equipped 32-bit systems. Logical addresses are usually stored in variables of type unsigned long or void *. Memory returned from kmalloc has a kernel logical address.
,ch15.13676 Page 415 Friday, January 21, 2005 11:04 AM Different kernel functions require different types of addresses. It would be nice if there were different C types defined, so that the required address types were explicit, but we have no such luck. In this chapter, we try to be clear on which types of addresses are used where. Physical Addresses and Pages Physical memory is divided into discrete units called pages. Much of the system’s internal handling of memory is done on a per-page basis.
,ch15.13676 Page 416 Friday, January 21, 2005 11:04 AM needed for the kernel code itself. As a result, x86-based Linux systems could work with a maximum of a little under 1 GB of physical memory. In response to commercial pressure to support more memory while not breaking 32bit application and the system’s compatibility, the processor manufacturers have added “address extension” features to their products.
,ch15.13676 Page 417 Friday, January 21, 2005 11:04 AM there is one struct page for each physical page on the system. Some of the fields of this structure include the following: atomic_t count; The number of references there are to this page. When the count drops to 0, the page is returned to the free list. void *virtual; The kernel virtual address of the page, if it is mapped; NULL, otherwise. Lowmemory pages are always mapped; high-memory pages usually are not.
,ch15.13676 Page 418 Friday, January 21, 2005 11:04 AM defined in . In most situations, you want to use a version of kmap rather than page_address. #include void *kmap(struct page *page); void kunmap(struct page *page); kmap returns a kernel virtual address for any page in the system. For low-memory pages, it just returns the logical address of the page; for high-memory pages, kmap creates a special mapping in a dedicated part of the kernel address space.
,ch15.13676 Page 419 Friday, January 21, 2005 11:04 AM Virtual Memory Areas The virtual memory area (VMA) is the kernel data structure used to manage distinct regions of a process’s address space. A VMA represents a homogeneous region in the virtual memory of a process: a contiguous range of virtual addresses that have the same permission flags and are backed up by the same object (a file, say, or swap space).
,ch15.13676 Page 420 Friday, January 21, 2005 11:04 AM Each field in /proc/*/maps (except the image name) corresponds to a field in struct vm_area_struct: start end The beginning and ending virtual addresses for this memory area. perm A bit mask with the memory area’s read, write, and execute permissions. This field describes what the process is allowed to do with pages belonging to the area. The last character in the field is either p for “private” or s for “shared.
,ch15.13676 Page 421 Friday, January 21, 2005 11:04 AM VMAs are as follows (note the similarity between these fields and the /proc output we just saw): unsigned long vm_start; unsigned long vm_end; The virtual address range covered by this VMA. These fields are the first two fields shown in /proc/*/maps. struct file *vm_file; A pointer to the struct file structure associated with this area (if any). unsigned long vm_pgoff; The offset of the area in the file, in pages.
,ch15.13676 Page 422 Friday, January 21, 2005 11:04 AM struct page *(*nopage)(struct vm_area_struct *vma, unsigned long address, int *type); When a process tries to access a page that belongs to a valid VMA, but that is currently not in memory, the nopage method is called (if it is defined) for the related area. The method returns the struct page pointer for the physical page after, perhaps, having read it in from secondary storage.
,ch15.13676 Page 423 Friday, January 21, 2005 11:04 AM The full list of the X server’s VMAs is lengthy, but most of the entries are not of interest here. We do see, however, four separate mappings of /dev/mem, which give some insight into how the X server works with the video card. The first mapping is at a0000, which is the standard location for video RAM in the 640-KB ISA hole. Further down, we see a large mapping at e8000000, an address which is above the highest RAM address on the system.
,ch15.13676 Page 424 Friday, January 21, 2005 11:04 AM video memory; mapping the graphic display to user space dramatically improves the throughput, as opposed to an lseek/write implementation. Another typical example is a program controlling a PCI device. Most PCI peripherals map their control registers to a memory address, and a high-performance application might prefer to have direct access to the registers instead of repeatedly having to call ioctl to get its work done.
,ch15.13676 Page 425 Friday, January 21, 2005 11:04 AM The value returned by the function is the usual 0 or a negative error code. Let’s look at the exact meaning of the function’s arguments: vma The virtual memory area into which the page range is being mapped. virt_addr The user virtual address where remapping should begin. The function builds page tables for the virtual address range between virt_addr and virt_addr+size.
,ch15.13676 Page 426 Friday, January 21, 2005 11:04 AM derived from drivers/char/mem.
,ch15.13676 Page 427 Friday, January 21, 2005 11:04 AM To make these operations active for a specific mapping, it is necessary to store a pointer to simple_remap_vm_ops in the vm_ops field of the relevant VMA. This is usually done in the mmap method. If you turn back to the simple_remap_mmap example, you see these lines of code: vma->vm_ops = &simple_remap_vm_ops; simple_vma_open(vma); Note the explicit call to simple_vma_open.
,ch15.13676 Page 428 Friday, January 21, 2005 11:04 AM The nopage method should also store the type of fault in the location pointed to by the type argument—but only if that argument is not NULL. In device drivers, the proper value for type will invariably be VM_FAULT_MINOR.
,ch15.13676 Page 429 Friday, January 21, 2005 11:04 AM Otherwise, pfn_to_page gets the necessary struct page pointer; we can increment its reference count (with a call to get_page) and return it. The nopage method normally returns a pointer to a struct page. If, for some reason, a normal page cannot be returned (e.g., the requested address is beyond the device’s memory region), NOPAGE_SIGBUS can be returned to signal the error; that is what the simple code above does.
,ch15.13676 Page 430 Friday, January 21, 2005 11:04 AM Note that the user process can always use mremap to extend its mapping, possibly past the end of the physical device area. If your driver fails to define a nopage method, it is never notified of this extension, and the additional area maps to the zero page.
,ch15.13676 Page 431 Friday, January 21, 2005 11:04 AM map the physical page located at address 64 KB—instead, we see a page full of zeros (the host computer in this example is a PC, but the result would be the same on other platforms): morgana.root# .
,ch15.13676 Page 432 Friday, January 21, 2005 11:04 AM simply does not know how to properly manage reference counts for pages that are part of higher-order allocations. (Return to the section “A scull Using Whole Pages: scullp” in Chapter 8 if you need a refresher on scullp and the memory allocation order value.) The zero-order limitation is mostly intended to keep the code simple.
,ch15.13676 Page 433 Friday, January 21, 2005 11:04 AM Most of the work is then performed by nopage.
,ch15.13676 Page 434 Friday, January 21, 2005 11:04 AM crw-r--r-1 root root 10, 175 Sep morgana% .
,ch15.13676 Page 435 Friday, January 21, 2005 11:04 AM Performing Direct I/O Most I/O operations are buffered through the kernel. The use of a kernel-space buffer allows a degree of separation between user space and the actual device; this separation can make programming easier and can also yield performance benefits in many situations. There are cases, however, where it can be beneficial to perform I/O directly to or from a user-space buffer.
,ch15.13676 Page 436 Friday, January 21, 2005 11:04 AM This function has several arguments: tsk A pointer to the task performing the I/O; its main purpose is to tell the kernel who should be charged for any page faults incurred while setting up the buffer. This argument is almost always passed as current. mm A pointer to the memory management structure describing the address space to be mapped.
,ch15.13676 Page 437 Friday, January 21, 2005 11:04 AM list from the array of struct page pointers. We discuss how to do this in the section, “Scatter/gather mappings.” Once your direct I/O operation is complete, you must release the user pages. Before doing so, however, you must inform the kernel if you changed the contents of those pages.
,ch15.13676 Page 438 Friday, January 21, 2005 11:04 AM For the rare driver author who needs to implement asynchronous I/O, we present a quick overview of how it works. We cover asynchronous I/O in this chapter, because its implementation almost always involves direct I/O operations as well (if you are buffering data in the kernel, you can usually implement asynchronous behavior without imposing the added complexity on user space). Drivers supporting asynchronous I/O should include .
,ch15.13676 Page 439 Friday, January 21, 2005 11:04 AM needs to know about the operation, and return -EIOCBQUEUED to the caller. Remembering the operation information includes arranging access to the user-space buffer; once you return, you will not again have the opportunity to access that buffer while running in the context of the calling process. In general, that means you will likely have to set up a direct kernel mapping (with get_user_pages) or a DMA mapping.
,ch15.13676 Page 440 Friday, January 21, 2005 11:04 AM /* Copy now while we can access the buffer */ if (write) result = scullp_write(iocb->ki_filp, buf, count, &pos); else result = scullp_read(iocb->ki_filp, buf, count, &pos); /* If this is a synchronous IOCB, we return our status now. */ if (is_sync_kiocb(iocb)) return result; /* Otherwise defer the completion for a few milliseconds.
,ch15.13676 Page 441 Friday, January 21, 2005 11:04 AM Overview of a DMA Data Transfer Before introducing the programming details, let’s review how a DMA transfer takes place, considering only input transfers to simplify the discussion. Data transfer can be triggered in two ways: either the software asks for data (via a function such as read) or the hardware asynchronously pushes data to the system. In the first case, the steps involved can be summarized as follows: 1.
,ch15.13676 Page 442 Friday, January 21, 2005 11:04 AM Another relevant item introduced here is the DMA buffer. DMA requires device drivers to allocate one or more special buffers suited to DMA. Note that many drivers allocate their buffers at initialization time and use them until shutdown—the word allocate in the previous lists, therefore, means “get hold of a previously allocated buffer.
,ch15.13676 Page 443 Friday, January 21, 2005 11:04 AM when the requested buffer is far less than 128 KB, because system memory becomes fragmented over time.* When the kernel cannot return the requested amount of memory or when you need more than 128 KB (a common requirement for PCI frame grabbers, for example), an alternative to returning -ENOMEM is to allocate memory at boot time or reserve the top of physical RAM for your buffer.
,ch15.13676 Page 444 Friday, January 21, 2005 11:04 AM At the lowest level (again, we’ll look at a higher-level solution shortly), the Linux kernel provides a portable solution by exporting the following functions, defined in . The use of these functions is strongly discouraged, because they work properly only on systems with a very simple I/O architecture; nonetheless, you may encounter them when working with kernel code.
,ch15.13676 Page 445 Friday, January 21, 2005 11:04 AM The mask should show the bits that your device can address; if it is limited to 24 bits, for example, you would pass mask as 0x0FFFFFF. The return value is nonzero if DMA is possible with the given mask; if dma_set_mask returns 0, you are not able to use DMA operations with this device.
,ch15.13676 Page 446 Friday, January 21, 2005 11:04 AM coherency in the hardware, but others require software support. The generic DMA layer goes to great lengths to ensure that things work correctly on all architectures, but, as we will see, proper behavior requires adherence to a small set of rules. The DMA mapping sets up a new type, dma_addr_t, to represent bus addresses.
,ch15.13676 Page 447 Friday, January 21, 2005 11:04 AM this function so that the buffer is placed in a location that works with DMA; usually the memory is just allocated with get_free_pages (but note that the size is in bytes, rather than an order value). The flag argument is the usual GFP_ value describing how the memory is to be allocated; it should usually be GFP_KERNEL (usually) or GFP_ATOMIC (when running in atomic context).
,ch15.13676 Page 448 Friday, January 21, 2005 11:04 AM returned. As with dma_alloc_coherent, the address of the resulting DMA buffer is returned as a kernel virtual address and stored in handle as a bus address. Unneeded buffers should be returned to the pool with: void dma_pool_free(struct dma_pool *pool, void *vaddr, dma_addr_t addr); Setting up streaming DMA mappings Streaming mappings have a more complicated interface than the coherent variety, for a number of reasons.
,ch15.13676 Page 449 Friday, January 21, 2005 11:04 AM Some important rules apply to streaming DMA mappings: • The buffer must be used only for a transfer that matches the direction value given when it was mapped. • Once a buffer has been mapped, it belongs to the device, not the processor. Until the buffer has been unmapped, the driver should not touch its contents in any way.
,ch15.13676 Page 450 Friday, January 21, 2005 11:04 AM The processor, once again, should not access the DMA buffer after this call has been made. Single-page streaming mappings Occasionally, you may want to set up a mapping on a buffer for which you have a struct page pointer; this can happen, for example, with user-space buffers mapped with get_user_pages.
,ch15.13676 Page 451 Friday, January 21, 2005 11:04 AM dependent, and is described in . However, it always contains three fields: struct page *page; The struct page pointer corresponding to the buffer to be used in the scatter/gather operation.
,ch15.13676 Page 452 Friday, January 21, 2005 11:04 AM Scatter/gather mappings are streaming DMA mappings, and the same access rules apply to them as to the single variety.
,ch15.13676 Page 453 Friday, January 21, 2005 11:04 AM void pci_dac_dma_sync_single_for_device(struct pci_dev *pdev, dma64_addr_t dma_addr, size_t len, int direction); A simple PCI DMA example As an example of how the DMA mappings might be used, we present a simple example of DMA coding for a PCI device. The actual form of DMA operations on the PCI bus is very dependent on the device being driven.
,ch15.13676 Page 454 Friday, January 21, 2005 11:04 AM Obviously, a great deal of detail has been left out of this example, including whatever steps may be required to prevent attempts to start multiple, simultaneous DMA operations. DMA for ISA Devices The ISA bus allows for two kinds of DMA transfers: native DMA and ISA bus master DMA. Native DMA uses standard DMA-controller circuitry on the motherboard to drive the signal lines on the ISA bus.
,ch15.13676 Page 455 Friday, January 21, 2005 11:04 AM The channels are numbered from 0–7: channel 4 is not available to ISA peripherals, because it is used internally to cascade the slave controller onto the master. The available channels are, thus, 0–3 on the slave (the 8-bit channels) and 5–7 on the master (the 16-bit channels). The size of any DMA transfer, as stored in the controller, is a 16-bit number representing the number of bus cycles.
,ch15.13676 Page 456 Friday, January 21, 2005 11:04 AM /* ... */ if ( (error = request_irq(my_device.irq, dad_interrupt, SA_INTERRUPT, "dad", NULL)) ) return error; /* or implement blocking open */ if ( (error = request_dma(my_device.dma, "dad")) ) { free_irq(my_device.irq, NULL); return error; /* or implement blocking open */ } /* ...
,ch15.13676 Page 457 Friday, January 21, 2005 11:04 AM particular, we don’t deal with the issue of 8-bit versus 16-bit data transfers. If you are writing device drivers for ISA device boards, you should find the relevant information in the hardware manuals for the devices. The DMA controller is a shared resource, and confusion could arise if more than one processor attempts to program it simultaneously. For that reason, the controller is protected by a spinlock, called dma_spin_lock.
,ch15.13676 Page 458 Friday, January 21, 2005 11:04 AM In addition to these functions, there are a number of housekeeping facilities that must be used when dealing with DMA devices: void disable_dma(unsigned int channel); A DMA channel can be disabled within the controller. The channel should be disabled before the controller is configured to prevent improper operation.
,ch15.13676 Page 459 Friday, January 21, 2005 11:04 AM int residue; unsigned long flags = claim_dma_lock ( ); residue = get_dma_residue(channel); release_dma_lock(flags); return (residue = = 0); } The only thing that remains to be done is to configure the device board. This devicespecific task usually consists of reading or writing a few I/O ports. Devices differ in significant ways.
,ch15.13676 Page 460 Friday, January 21, 2005 11:04 AM unsigned long kmap(struct page *page); void kunmap(struct page *page); kmap returns a kernel virtual address that is mapped to the given page, creating the mapping if need be. kunmap deletes the mapping for the given page. #include #include
,ch15.13676 Page 461 Friday, January 21, 2005 11:04 AM int is_sync_kiocb(struct kiocb *iocb); Macro that returns nonzero if the given IOCB requires synchronous execution. int aio_complete(struct kiocb *iocb, long res, long res2); Function that indicates completion of an asynchronous I/O operation. Direct Memory Access #include
,ch15.13676 Page 462 Friday, January 21, 2005 11:04 AM dma_addr_t dma_map_single(struct device *dev, void *buffer, size_t size, enum dma_data_direction direction); void dma_unmap_single(struct device *dev, dma_addr_t bus_addr, size_t size, enum dma_data_direction direction); Create and destroy a single-use, streaming DMA mapping.
,ch15.13676 Page 463 Friday, January 21, 2005 11:04 AM int request_dma(unsigned int channel, const char *name); void free_dma(unsigned int channel); Access the DMA registry. Registration must be performed before using ISA DMA channels. unsigned long claim_dma_lock( ); void release_dma_lock(unsigned long flags); Acquire and release the DMA spinlock, which must be held prior to calling the other ISA DMA functions described later in this list.