Repeated DNA Sequences · Dongyan Li's Notebook

https://leetcode.com/problems/repeated-dna-sequences/

brutal

Use 2 Set to store the substrings. If there’s a duplicate, then add to the second set. After the loop, addAll(set2) to list.

public class Solution {
    public List<String> findRepeatedDnaSequences(String s) {
        Set<String> set = new HashSet<>();
        Set<String> set2 = new HashSet<>();
        List<String> lst = new ArrayList<>();
        for (int i = 0; i + 10 <= s.length(); i++) {
            String sub = s.substring(i, i + 10);
            if (!set.add(sub)) {
                set2.add(sub);
            }
        }
        lst.addAll(set2);
        return lst;
    }
}

bit manipulation

There are only 4 kinds of character in the String: A, C, G and T. Ideally, we only need 2 bits to encode these four characters:

A: 00
C: 01
G: 10
T: 11

So for entire 10-letter-substring, we only need a total of 2 * 10 = 20 bits (4 bytes) to store the substring. Compared to the naive way, 20 bytes are used.

Then, combined with bit shifting, we could use our own encoding system.

public class Solution {
    public List<String> findRepeatedDnaSequences(String s) {
        Set<Integer> set1 = new HashSet<>();
        Set<Integer> set2 = new HashSet<>();
        List<String> lst = new ArrayList<>();
        int[] map = new int[26];
        map['C' - 'A'] = 1;
        map['G' - 'A'] = 2;
        map['T' - 'A'] = 3;
        for (int i = 0; i + 10 <= s.length(); i++) {
            int substring = 0;
            for (int j = i; j < i + 10; j++) {
                substring = substring << 2;
                char c = s.charAt(j);
                substring |= map[c - 'A'];
            }
            if (!set1.add(substring)) {
                if (set2.add(substring)) {
                    lst.add(s.substring(i, i + 10));
                }
            }
        }
        return lst;
    }
}