Opened 8 years ago
Closed 8 years ago
#1441 closed defect (fixed)
random systemtest failure
Reported by: | Dimitar Misev | Owned by: | |
---|---|---|---|
Priority: | critical | Milestone: | 9.4 |
Component: | rasnet | Version: | development |
Keywords: | Cc: | Peter Baumann, Vlad Merticariu, Bang Pham Huu | |
Complexity: | Medium |
Description
It seems like a completely random query every now and then fails in the systemtest. It's happened with a both update and select queries. The output from rasql is:
rasdaman error 0: The client failed to contact the server.
and there is no trace of the query in the server logs, so it looks like a random rasnet issue.
Attachments (1)
Change History (7)
comment:1 by , 8 years ago
by , 8 years ago
Attachment: | scaling_2.rasql.out added |
---|
comment:2 by , 8 years ago
I think this error happens when the query is sent right at the time when a rasserver is being restarted (or that is one of the conditions):
rasserver.6d8e53d1-f8f3-4019-9aa3-a144d93ed2d2.005857.log: [INFO] - 16/11/2016 14:09:25.5981: Request: 'select scale( test_grey, [ 4, 3 ] ) from test_grey'... rasserver.6d8e53d1-f8f3-4019-9aa3-a144d93ed2d2.005857.log: [INFO] - 16/11/2016 14:09:25.6004: parsing... rasserver.6d8e53d1-f8f3-4019-9aa3-a144d93ed2d2.005857.log: [INFO] - 16/11/2016 14:09:25.6085: checking semantics... rasserver.6d8e53d1-f8f3-4019-9aa3-a144d93ed2d2.005857.log: [INFO] - 16/11/2016 14:09:25.6168: evaluating... rasserver.6d8e53d1-f8f3-4019-9aa3-a144d93ed2d2.005857.log: [INFO] - 16/11/2016 14:09:25.8343: ok, result type 'set <marray <char, [0:1023,0:632]>>', 1 element(s), total size 648192 bytes. rasserver.6d8e53d1-f8f3-4019-9aa3-a144d93ed2d2.005857.log: [INFO] - 16/11/2016 14:09:25.55990: Request: 'select c from RAS_COLLECTIONNAMES as c'... rasserver.6d8e53d1-f8f3-4019-9aa3-a144d93ed2d2.005857.log: [INFO] - 16/11/2016 14:09:25.56054: parsing... rasserver.6d8e53d1-f8f3-4019-9aa3-a144d93ed2d2.005857.log: [INFO] - 16/11/2016 14:09:25.56215: checking semantics... rasserver.6d8e53d1-f8f3-4019-9aa3-a144d93ed2d2.005857.log: [INFO] - 16/11/2016 14:09:25.56578: evaluating... rasserver.6d8e53d1-f8f3-4019-9aa3-a144d93ed2d2.005857.log: [INFO] - 16/11/2016 14:09:25.56609: ok, result type 'set <marray <char, [0:9]>>', 6 element(s), total size 60 bytes. rasserver.6d8e53d1-f8f3-4019-9aa3-a144d93ed2d2.005857.log: [INFO] - 16/11/2016 14:09:25.75735: Exiting server process. # rasql attempts to execute the query in this timeframe rasserver.8b0d3189-00ae-48ee-af6a-da6342431950.012144.log: [INFO] - 16/11/2016 14:09:25.140767: rasserver: rasdaman server v9.3.1-gfc2a3ec on base DBMS sqlite. rasserver.8b0d3189-00ae-48ee-af6a-da6342431950.012144.log: [INFO] - 16/11/2016 14:09:25.140852: Copyright 2003, 2004, 2005, 2006, 2007, 2008, 2009 Peter Baumann rasdaman GmbH. rasserver.8b0d3189-00ae-48ee-af6a-da6342431950.012144.log: [INFO] - 16/11/2016 14:09:25.140893: To obtain a list of external packages used, please visit www.rasdaman.org. rasserver.8b0d3189-00ae-48ee-af6a-da6342431950.012144.log: [INFO] - 16/11/2016 14:09:25.140927: Server 8b0d3189-00ae-48ee-af6a-da6342431950 of type rasserver.8b0d3189-00ae-48ee-af6a-da6342431950.012144.log: [INFO] - 16/11/2016 14:09:25.140962: RPC, registered with prognum 0x1b5a rasserver.8b0d3189-00ae-48ee-af6a-da6342431950.012144.log: [INFO] - 16/11/2016 14:09:25.140992: , connecting to sqlite as '/opt/rasdaman/data//RASBASE' rasserver.8b0d3189-00ae-48ee-af6a-da6342431950.012144.log: [INFO] - 16/11/2016 14:09:25.141019: Verifying rasmgr host name: localhost... rasserver.8b0d3189-00ae-48ee-af6a-da6342431950.012144.log: [INFO] - 16/11/2016 14:09:25.141224: ok rasserver.8b0d3189-00ae-48ee-af6a-da6342431950.012144.log: [INFO] - 16/11/2016 14:09:25.141257: Tile size set to : 4194304 rasserver.8b0d3189-00ae-48ee-af6a-da6342431950.012144.log: [INFO] - 16/11/2016 14:09:25.141280: PCTMin set to : 2048 rasserver.8b0d3189-00ae-48ee-af6a-da6342431950.012144.log: [INFO] - 16/11/2016 14:09:25.141300: PCTMax set to : 4096 rasserver.8b0d3189-00ae-48ee-af6a-da6342431950.012144.log: [INFO] - 16/11/2016 14:09:25.141318: IndexSize set to : 0 rasserver.8b0d3189-00ae-48ee-af6a-da6342431950.012144.log: [INFO] - 16/11/2016 14:09:25.141342: Default Tile Conf: [0:1023,0:1023] rasserver.8b0d3189-00ae-48ee-af6a-da6342431950.012144.log: [INFO] - 16/11/2016 14:09:25.141363: Default Tiling : AlignedTiling
So one way to test it is to set a low countdown parameter in rasmgr.conf and run a lot of queries perhaps.
comment:3 by , 8 years ago
Just found out that the -countdown option doesn't have any effect in rasnet actually..
# -countdown n server will automatically restart after n requests # (beware of long-running transactions!) ... define srv N2 -host hifi -type n -port 7003 -dbh rasdaman_host change srv N2 -countdown 0 -autorestart on -xp --timeout 30
comment:4 by , 8 years ago
I submitted a patch that increases the countdown parameter in rasmgr.conf which will hopefully fix these issues with the systemtest.
comment:5 by , 8 years ago
Milestone: | 9.3 → 9.4 |
---|
comment:6 by , 8 years ago
Resolution: | → fixed |
---|---|
Status: | new → closed |
I think we haven't had this issue anymore.
I'm not yet sure on how to go about reproducing this reliably and fixing it, but as a first step we can fix the systemtest at least:
@Bang: if a query fails, can you add a check on whether the output file is such an error, and retry the query if it is? The error in the output/query.out file is exactly (I attached the file)