HashMap详解

本文是以JDK7中的HashMap进行介绍, 先整体介绍, 再单独介绍一些具体的点

存储结构
构造函数
需要知道的概念
关于hash
计算桶坐标:bucketIndex
添加元素put
关于modCount
关于扩容
关于并发

存储结构

HashMap的数据结构是数组加链表的形式。结构大体如下:

HashMap结构图

HashMap类中有个table属性, 是一个数组, 数组元素类型为Entry(内部类), 而Entry有key, value, next, hash四个属性。
一般table数组的每个位置我们称为bucket, 下标称为bucketIndex, 即每一个位置表示一个桶, 同一个桶中的Entry元素hash值相同(null除外, 后面说)

构造函数

HashMap(int initialCapacity, float loadFactor): 指定初始容量和负载因子
HashMap(int initialCapacity): 指定初始容量, 使用默认负载因子
HashMap(): 使用默认初始容量和负载因子
HashMap(Map<? extends K, ? extends V> m): 通过另一个map初始化, 容量取m.size() / DEFAULT_LOAD_FACTOR) + 1与默认初始容量两者较大的那个,使用默认负载因子

需要知道的概念

HashMap中有几个概念, 还有几个成员属性, 需要了解它们的意义:

容量: table数组的大小(table.length), 并没有一个属性来描述, 内部用capacity()函数来获取数组大小. 默认初始容量DEFAULT_INITIAL_CAPACITY为16, 每次扩容都扩大2倍.
size: 该属性表示key-value对的元素个数, 即每新增一个Entry会加1, 每减少一个 Entry会减1
loadFactor: 负载因子, 这相当于一个系数, 用于计算threshold. 默认负载因子DEFAULT_LOAD_FACTOR为0.75
threshold: 阈值, 当size达到该阈值, 就需要扩容. 该阈值由capacity * loadFactor计算得到(构造函数里,直接threshold = initialCapacity, 但构造函数中没有给table分配空间, 分配空间时, 阈值由Math.min(capacity * loadFactor, MAXIMUM_CAPACITY + 1)决定).
modCount: 该属性可理解为对该map所做的修改次数, 比如增加或者减少一个Entry都会加1(size会变少, modCount只会变大)

关于hash

既然是HashMap, 一定避免不了hash. Object都有hashCode()函数, 但在HashMap又提供了一个辅助哈希函数hash(Object k), 这个函数在JDK7与JDK8中不同. 当key为null时, 是没有hashCode()的, 因此要单独处理. JDK7中有专门的putForNullKey()函数来把null放在下标为0的bucket中; 而JDK8直接在辅助哈希函数hash()中把null的哈希值设置为0.

    // JDK7
    final int hash(Object k) {
        int h = hashSeed;
        if (0 != h && k instanceof String) {
            return sun.misc.Hashing.stringHash32((String) k);
        }

        h ^= k.hashCode();

        // This function ensures that hashCodes that differ only by
        // constant multiples at each bit position have a bounded
        // number of collisions (approximately 8 at default load factor).
        h ^= (h >>> 20) ^ (h >>> 12);
        return h ^ (h >>> 7) ^ (h >>> 4);
    }
   // JDK8
    static final int hash(Object key) {
        int h;
        return (key == null) ? 0 : (h = key.hashCode()) ^ (h >>> 16);
    }

计算桶坐标:bucketIndex

这一步是用来计算数组下标(即应放到那个bucket中)的。但是这一步很关键，也正好解释了很多问题，如:

为什么初始容量DEFAULT_INITIAL_CAPACITY为16?
为什么扩容的时候是2倍扩容?
为什么容量总是2的次幂?(构造函数里传的初始容量为40, 最后得到的容量为64, 必须是2的次幂)

计算数组下标的函数如下:

    int hash = hash(key);  // 辅助哈希函数
    int i = indexFor(hash, table.length); // 求桶坐标的函数
    // 具体实现
    static int indexFor(int h, int length) {
        // assert Integer.bitCount(length) == 1 : "length must be a non-zero power of 2";
        return h & (length-1);
    }

正常来讲, 哈希值的索引一般通过index = h % length计算, 但是若length总是2的整数次幂, 就可以通过位运算h & (length-1)来优化.

添加元素put

刚创建的HashMap中的table是空的(EMPTY_TABLE), 真正添加元素的时候才回分配空间
添加元素时, 如果key一样, 会替换掉原来Entry的value
添加元素时, 有可能涉及到扩容
新添加的元素位于链表头, 即table[bucketIndex]的位置

    // JDK7 put方法
    public V put(K key, V value) {
        // 如果数组是空的, 需要根据初始容量来扩张. 默认构造函数中, threshold=initialCapacity
        if (table == EMPTY_TABLE) {
            inflateTable(threshold);  // table大小为: a power of 2 >= threshold
        }
        if (key == null)
            return putForNullKey(value);  // null 放到 table[0] 的位置
        int hash = hash(key);
        int i = indexFor(hash, table.length);  // 通过位运算得到桶的位置
        for (Entry<K,V> e = table[i]; e != null; e = e.next) {  // 这个循环里, 处理该key已经存在的情况
            Object k;
            if (e.hash == hash && ((k = e.key) == key || key.equals(k))) {
                V oldValue = e.value;
                e.value = value;
                e.recordAccess(this);
                return oldValue;  // 返回被替换掉的值
            }
        }
        modCount++;  // 修改次数增加
        addEntry(hash, key, value, i); // 真正的添加元素
        return null;
    }
    // 添加元素的函数
    void addEntry(int hash, K key, V value, int bucketIndex) {
        // 这就是需要扩容的条件(扩容的条件有2个)
        if ((size >= threshold) && (null != table[bucketIndex])) {
            resize(2 * table.length);  // 扩容是在这里进行的
            hash = (null != key) ? hash(key) : 0;
            bucketIndex = indexFor(hash, table.length);  // 扩容后重新计算桶的位置
        }
        // 新创建一个元素并添加到 bucketIndex 位置
        createEntry(hash, key, value, bucketIndex);
    }
    void createEntry(int hash, K key, V value, int bucketIndex) {
        Entry<K,V> e = table[bucketIndex];
        table[bucketIndex] = new Entry<>(hash, key, value, e); // 放到链表头
        size++;  // 元素数量+1
    }

关于modCount

平时使用HashMap遇到过的一个异常是ConcurrentModificationException. 这种情况是因为在使用迭代器的过程中, 增加或者删除了元素, 导致modCount变化导致的. 原因就在迭代器中:

// 迭代器构造方法
HashIterator() {
    expectedModCount = modCount;  // 创建迭代器的时候expectedModCount=modCount, 不相等的时候快速失败
    if (size > 0) { // advance to first entry
        Entry[] t = table;
        while (index < t.length && (next = t[index++]) == null) // 这里第一个有元素的桶(index为桶坐标)
            ;
    }
}
// 迭代器找下一个元素方法
final Entry<K,V> nextEntry() {
    if (modCount != expectedModCount) // 在这里抛出异常, 快速失败
        throw new ConcurrentModificationException();
    Entry<K,V> e = next;
    if (e == null)
        throw new NoSuchElementException();

    if ((next = e.next) == null) {  // 如果 e.next是null, 就到了链表尾, 此时应该找下一个桶中的元素
        Entry[] t = table;
        while (index < t.length && (next = t[index++]) == null)  // 下一个有元素的桶
            ;
    }
    current = e;
    return e;
}

关于扩容

HashMap扩容是通过resize(int newCapacity)函数实现的, 而扩容的时机, 则是在put函数里真正添加元素时调用addEntry来控制的.

扩容需要两个条件:

size >= threshold: 元素个数达到阈值
null != table[bucketIndex]: 新元素要放的桶中有元素(该桶中没有元素不会扩容)

扩容过程如下:

// 在 addEntry 中调用resize扩容
resize(2 * table.length); // 每次都是2倍扩容
// resize函数
void resize(int newCapacity) { // capacity MUST be a power of two
    Entry[] oldTable = table;
    int oldCapacity = oldTable.length;
    if (oldCapacity == MAXIMUM_CAPACITY) {  // 最大容量时就不再扩容了
        threshold = Integer.MAX_VALUE;
        return;
    }
    Entry[] newTable = new Entry[newCapacity];
    transfer(newTable, initHashSeedAsNeeded(newCapacity)); // 这里把原来的元素移动到新table中
    table = newTable;
    threshold = (int)Math.min(newCapacity * loadFactor, MAXIMUM_CAPACITY + 1); // 更新阈值
}
// 移动元素的 transfer 函数
void transfer(Entry[] newTable, boolean rehash) {
    int newCapacity = newTable.length;
    for (Entry<K,V> e : table) {
        while(null != e) {
            Entry<K,V> next = e.next; // 头插法
            if (rehash) {
                e.hash = null == e.key ? 0 : hash(e.key);
            }
            int i = indexFor(e.hash, newCapacity); // 新的桶位置(通过位运算来计算)
            e.next = newTable[i];
            newTable[i] = e;
            e = next;
        }
    }
}

关于并发

HashMap不是线程安全的, 一般还会提到一个不常用的但是线程安全的对象Hashtable, 它相当于HashMap的方法(包括toString)加锁(synchronized)来控制并发.

若要考虑并发问题, 一般推荐使用ConcurrentHashMap, 它通过Segment来实现分段加锁。

Gitalking ...