For you readers out there that use the ANSYS Remote Solve Manager (RSM) and have had one or all of the below questions, this post might just be for you!
- What actually happens after I submit my job to RSM?
- Where are the files needed to run the solve go?
- How do the files get returned to the client machine, or do they?
- What if something goes wrong with my solve or in the RSM file downloading process, is there any hope of recovery?
- Are there any recommendations out there for how best to use RSM?
If your question is, how do I setup RSM as a user? You answers are here from a post by Ted Harris. The post today is a deeper dive into RSM.
The answers to questions 1 through 3 above are really only necessary if you would like to know the answer to question 4. My reason for giving you a greater understanding of the RSM process is so that you can do a better job of troubleshooting should your RSM job run into an issue. Also, please note that this process is specifically for an RSM job submitted for ANSYS Mechanical. I have not tested this yet for a fluid flow run.
What happens when a job gets submitted to RSM?
The following will answer questions 1-3 above.
When a job is run locally (on your machine), ANSYS uses the Solver Files Directory to store and update data. That folder can be found by right clicking on the Solution branch in the Model tree and selecting Open Solver Files Directory.
The project directory will be opened and you can see all of the existing files stored for your particular solution:
When a job gets submitted to RSM, the files that are stored in the above folder will be transferred to a series of two temporary directories. One temporary directory on the client side (where you launched the job from) and one temporary directory on the compute server side (where the numbers get crunched).
After you hit solve for a remote solve, you will notice that your project solver directory gets emptied. Those files are transferred to a temporary directory under the _ProjectScratch directory:
Next, these files get transferred to a temporary directory on the compute server. The files in the _ProjectScratch directory will remain there but the folder will not be updated again until the solve is interrupted or finished.
You can find the location of the compute server temporary directory by looking at the output log in the RSM queueing interface:
If you navigate to that directory on your compute server, you will see all of the necessary files needed to run. Depending on your IT structure, you may or may not have access to this directory, but it is there.
Here is a graphical overview of the route that your files will experience during the RSM solve process.
Once your run is completed or you have interrupted it to review intermediate results and your results have been downloaded and transferred to the solver files folder, both of the temporary directories get cleaned up and removed. I have just outlined the basic process that goes on behind the scenes when you have submitted a job to RSM.
What if something goes wrong with my RSM job? Can I recover my data and re-read it into Workbench?
Recently, I ran into a problem with one of my RSM jobs that resulted in me losing all of the data that had been generated during a two day run. The exact cause of this problem I haven’t determined but it did force me to dive into the RSM process and discover what I am sharing with you today. By pin-pointing and understanding what goes on after the job is submitted to RSM, I did determine that it can be possible to recover data, but only under certain circumstances and setup.
First, if you have the “Delete Job Files in Working Directory” box checked in the compute server properties menu accessed from the RSM queue interface (see below) and RSM sees your job as being completed, the answer to the above question is no, you will not be able to recover your data. Essentially, because the compute server is cleaned up and the temporary directory gets deleted, the files are lost.
To avoid lost data and prepare for such a catastrophe, my recommendation is that you or your IT department, uncheck the “Delete Job Files in Working Directory” box. That way, you have a backup copy of your files stored on the server that you can delete later when you are sure you have all of your files safely transferred to your solver files folder within your project directory structure.
The downside to having this box unchecked is that you have to manually cleanup your server. Your IT department might not like, or even allow you to do this because it could clutter your server if you do not stay on top of things. But, it could be worth the safety net.
As for getting your data back into Workbench, you will need to manually copy the files on the compute server to your solver files folder in your Workbench project directory structure. I explained how to access this folder at the beginning of this post. Once you have copied those files, back in the Mechanical application, with the Solution branch of your model tree highlighted, selects Tools>Read Results Files… (see below graphic), navigate to your solver files directory, select the *.rst file and read it in.
Once the results file is read in, you should see whatever information is available.
Though it is possible to run concurrent RSM jobs from the same project, my recommendation is to only run one RSM job at a time from the same project in order to avoid communication or licensing holdups
Unless you are confident that you will not ever need to recover files, consider unchecking the “Delete Job Files in Working Directory” box in the compute server properties menu.
Note: if you are not allowed access to your compute server temporary directories, you should probably consult your IT department to get approval for this action.
Caution: if you uncheck this box, be sure that you stay on top cleaning up your compute server once you have your files successfully downloaded
Depending on your network speed, when your results files get large, >15GB, be prepared to wait for upload and download times. There is likely activity, but you might not be able to “see” it in the progress information on the RSM output feed. Be patient or work outside of RSM using a batch MAPDL process.
Avoid hitting the “Interrupt Solution” command more than once. I have not verified this, but I believe this can cause mis-communication between the compute server and local machine temporary directories which can cause RSM to think that there are no files associated with your run to be transferred.