Python Binary Search for Dictionary -


i have huge .txt.gz file ~800kb of words , score values formatted "value word" each line , sorted word. fastest way lookup value of given word?

i can read file using:

import gzip  f = gzip.open('somefile.txt.gz', 'rb') file_content = f.read() 

would quickest method binary search of sort?

sample data:

0.090909 chevre#n#1

0.058824 chevron#n#1

0.071429 chevron#n#2

0.071429 chevrotain#n#1

0.166667 chewa#n#1

a binary search effective way this, you're still going have move data text (just bunch of bytes, after all) other data structure list. if have program short lifetime, or no long-term memory restrictions, (probably) faster load whole thing python dict @ startup (or whenever appropriate):

# may not work right file format, idea. lookup = {} line in f:     if line:         value, key = line.trim().split():         lookup[key] = value 

then access using python's builtin dictionaries, nice , fast:

def get_value(word):     return lookup.get(word) 

edit

if option read in whole file each word, , you're searching many words, time save implementing clever search algorithm going marginal compared time spend opening , reading files on , over. want database, handle sort of thing quickly. said, given these parameters, i'd if had use filesystem:

import bisect  # define searchable (word, value) tuples every word in file. # i'm assuming files sorted, if not, sort list (slow!!) words = [(w[1], w[0]) w in (line.strip().split() line in f if line)]  # binary search word , return associated value. def get_value(word):     idx = bisect.bisect_left(words, (word,none)) # tuples compare element-wise     if idx != len(words) , words[idx][0] == word:         return words[idx][1]     raise valueerror('word not found') 

finally, notice you're using gzipped files, sensible if storage space issue, it's going slow process down more. once again have suggest database. in case, don't know if you're having trouble here, in case, reading gzipped files not "harder" reading normal files. take @ gzip module. basically, gzip files work regular files, can still write for line in file , on.


Comments

Popular posts from this blog

jquery - How can I dynamically add a browser tab? -

node.js - Getting the socket id,user id pair of a logged in user(s) -

keyboard - C++ GetAsyncKeyState alternative -