ProgrammingError: (1064, 'You have an error in your SQL syntax; check the manual that corresponds to your MySQL server version for the right syntax

20,452

you can try;

self.cursor.execute("UPDATE urls SET state=%d,content='%s' WHERE url='%s'"%(state,self.conn.escape_string(content),url))
Share:
20,452
Moon Taejin
Author by

Moon Taejin

hello! i'm Moon Taejin! have a good nice day!

Updated on October 10, 2020

Comments

  • Moon Taejin
    Moon Taejin over 3 years

    I'm making a simple crawling program with python. So, I used MySQL,Python. But when I executed this simple program, a error occurs. And then the contents that were crawled from web weren't updated on MySQL table. This error message shows ProgrammingError and syntax error. But I don't think I typed wrong code. Because there are HTML tag in the problem point. Why HTML tab occur in the error message. I think there are something problem between MySQL and Python. here is the error message.

    Traceback (most recent call last):
      File "crawl.py", line 237, in <module>
        parseArticle( u )
      File "crawl.py", line 166, in parseArticle
        db.updateURL( url , contents )
      File "crawl.py", line 206, in updateURL
        self.cursor.execute("UPDATE urls SET state=%d,content='%s' WHERE url='%s'"%(state,content,url))
      File "/usr/lib/python2.7/dist-packages/MySQLdb/cursors.py", line 174, in execute
        self.errorhandler(self, exc, value)
      File "/usr/lib/python2.7/dist-packages/MySQLdb/connections.py", line 36, in defaulterrorhandler
        raise errorclass, errorvalue
    ProgrammingError: (1064, 'You have an error in your SQL syntax; check the manual that corresponds to your MySQL server version for the right syntax to use near \'\xeb\x8f\x8b\xec\x9b\x80\', dotum, sans-serif; }\r\n\t//--&gt;\r\n\t</style>\n<p style="TEXT-ALIGN: center\' at line 1')
    

    And here is the source code. thank you for seeing my help.

    # -*- coding: utf-8 -*-
    
    from bs4 import BeautifulSoup
    import robotparser
    import urllib2
    import time, traceback, re, sys, os
    import MySQLdb
    
    crawler_name = 'daum_blog_crawler'     
    mainpage = 'http://blog.daum.net/'      
    
    rp = robotparser.RobotFileParser( mainpage + 'robots.txt' )
    rp.read()
    
    def canFetch( url ):
            return rp.can_fetch( crawler_name, url )
    
    def getContent( url, delay=1):
            time.sleep( delay )
    
            if not canFetch( url ):
                    print 'This url can NOT be fetched by our crawler :', url
                    return None
            try:
                    opener = urllib2.build_opener()
                    opener.addheaders = [('User-agent',crawler_name)]
                    contents = opener.open(url).read()
            except:
                    traceback.print_exc() 
                    return None
            return contents
    
    def getArticleInfo( soup ):
    
            rBlog = re.compile('.+blog.daum.net/\w+/\d+.*?')
            URLs = soup('a',{'href':rBlog})
    
            return [ u.get('href').split('?')[0] for u in URLs ]
    
    def getOwnArticles( contents ):           
            ret = []
            soup = BeautifulSoup( contents )
            rBlog = re.compile('.+/BlogTypeView.+')
            for u in soup('a',{'href':rBlog}):
                    href = u.get('href')
                    article = href.split('articleno=')[1].split('&')[0]
                    if ret.count(article)<1:
                            ret.append( article )
            return ret
    
    def gatherNeighborInfo( soup ):
    
            rBlog = re.compile('http://blog.daum.net/\w+')
            Neighbors = soup('a',{'href':rBlog})
            cnt = 0
            for n in Neighbors:
                    url = n.get('href')
                    blogname = url.split('/')[-1]
                    if url and url.startswith('http://') and db.isCrawledURL(url)<1:
                            db.insertURL( url, 1 ) 
    
                            url2 = getRedirectedURL( url )
                            if not url2: continue
                            re_url = 'http://blog.daum.net' + url2
                            body = getContent( re_url, 0 )
                            if body:
                                    for u in getOwnArticles( body ):
    
                                            fullpath = 'http://blog.daum.net/'+blogname+'/'+u
                                            cnt+=db.insertURL( fullpath )
    
            if cnt>0: print '%d neighbor articles inserted'%cnt
    
    def getRedirectedURL( url ):
            contents = getContent( url )
            if not contents: return None
    
            #redirect
            try:
                    soup = BeautifulSoup( contents )
                    frame = soup('frame')           
                    src = frame[0].get('src')
            except:
                    src = None
            return src
    
    def getBody( soup, parent ):
    
            rSrc = re.compile('.+/ArticleContentsView.+')
            iframe = soup('iframe',{'src':rSrc})
            if len(iframe)>0:
                    src = iframe[0].get('src')
                    iframe_src = 'http://blog.daum.net'+src
    
    
                    req = urllib2.Request( iframe_src )
                    req.add_header('Referer', parent )
                    body = urllib2.urlopen(req).read()
                    soup = BeautifulSoup( body ) 
                    strbody= str(soup.body)
                    return strbody   
            else:
                    print 'NULL contents'
                    return ''
    
    def parseArticle( url ):
    
            article_id = url.split('/')[-1]
            blog_id = url.split('/')[-2]
    
            #for debugging, temp
            if blog_id.isdigit():
                    print 'digit:', url.split('/')
    
            newURL = getRedirectedURL( url )
    
            if newURL:
    
                    newURL = 'http://blog.daum.net'+newURL
                    print 'redirecting', newURL
                    contents = getContent( newURL, 0 )
                    if not contents:
                            print 'Null Contents...'
    
                            db.updateURL( url, -1 )
                            return
    
    
                    soup = BeautifulSoup( contents )
    
    
                    gatherNeighborInfo( soup )              
    
    
                    n=0
                    for u in getArticleInfo( soup ):
                            n+=db.insertURL( u )
                    if n>0: print 'inserted %d urls from %s'%(n,url)
    
    
                    sp = contents.find('<title>')
                    if sp>-1:
                            ep = contents[sp+7:].find('</title>')
                            title = contents[sp+7:sp+ep+7]
                    else:
                            title = ''
    
    
                    contents = getBody( soup, newURL )  
    
    
                    db.updateURL( url , contents )
    
            else:
                    print 'Invalid blog article...'
    
                    db.updateURL( url, 'None', -1 )
    
    class DB:
            "MySQL wrapper class"
            def __init__(self):
                    self.conn = MySQLdb.connect(db='crawlDB', user='root', passwd='qltkd')
                    self.conn.query("set character_set_connection=utf8;")
                    self.conn.query("set character_set_server=utf8;")
                    self.conn.query("set character_set_client=utf8;")
                    self.conn.query("set character_set_results=utf8;")
                    self.conn.query("set character_set_database=utf8;")
                    self.cursor = self.conn.cursor()
                    self.cursor.execute('CREATE TABLE IF NOT EXISTS urls(url CHAR(150), state INT, content TEXT)')
            def commit(self):
                    self.conn.commit()
            def __del__(self):
                    self.conn.commit()
                    self.cursor.close()
    
            def insertURL(self, url, state=0, content=None):
                    #'/' delete
                    if url[-1]=='/': url=url[:-1]
                    try:    
                            self.cursor.execute("INSERT INTO urls VALUES ('%s',%d,'%s')"%(url,state,content))
                    except:
                            return 0
                    else:
                            return 1
    
            def selectUncrawledURL(self):
                    self.cursor.execute("SELECT * FROM urls where state=0")
                    return [ row[0] for row in self.cursor.fetchall() ]
    
            def updateURL(self, url, content, state=1):
                    if url[-1]=='/': url=url[:-1]
            self.cursor.execute("UPDATE urls SET state=%d,content='%s' WHERE url='%s'"%(state,content,url))
    
            def isCrawledURL(self, url):
                    if url[-1]=='/': url=url[:-1]
                    self.cursor.execute("SELECT COUNT(*) FROM urls WHERE url='%s' AND state=1"%url)
                    ret = self.cursor.fetchone()
                    return ret[0]
    
    db = DB()
    
    if __name__=='__main__':
            print 'starting crawl.py...'
    
    
            contents = getContent( mainpage )
            URLs = getArticleInfo( BeautifulSoup( contents ) )
            nSuccess = 0
            for u in URLs:
                    nSuccess += db.insertURL( u ) 
            print 'inserted %d new pages.'%nSuccess
    
    
            while 1:
                    uncrawled_urls = db.selectUncrawledURL()
                    if not uncrawled_urls: break
                    for u in uncrawled_urls: 
    
                            print 'downloading %s'%u
                            try:
                                    parseArticle( u )
                            except:
                                    traceback.print_exc()
                                    db.updateURL( u, -1 )
                            db.commit()
                    #bs.UpdateIndex()
    
    • Andreas Jung
      Andreas Jung over 10 years
      What is the actually full SQL statement here? String quoting problem?
    • Moon Taejin
      Moon Taejin over 10 years
      i don't know just ProgrammingError: (1064, 'You have an error in your SQL syntax; check the manual that corresponds to your MySQL server version for the right syntax to use near \'\xeb\x8f\x8b\xec\x9b\x80\', dotum, sans-serif; }\r\n\t//--&gt;\r\n\t</style>\n<p style="TEXT-ALIGN: center\' at line 1') and other errors are much the same except html tag part oh the error message.
    • Andreas Jung
      Andreas Jung over 10 years
      You don't know? It is your code, not ours, provide the requested information please. Adding a simple debug print statement is straight-forward. If you can not provide this little piece of information, please avoid programming
    • Mad Dog Tannen
      Mad Dog Tannen over 10 years
      I think the last \ of the string causes problems. Or you have a single quota somewhere ' that breaks the syntax
  • Moon Taejin
    Moon Taejin over 10 years
    Wow!!!! thank you! It works now! This error takes much time from me. But mccakici, Could you please explain to me what was wrong and what I didn't know? If you do that, that will be a great pleasure. thanks.
  • mccakici
    mccakici over 10 years
    your problem is single quota. for example; "UPDATE urls SET state=%d,content='bla' vs vs' .......... if use escape_string "UPDATE urls SET state=%d,content='bla\' vs vs' ..........