File API: direct access to a file

File API: direct access to a file

by Tomasz Muras -
Number of replies: 6
Picture of Core developers Picture of Plugin developers Picture of Plugins guardians Picture of Translators

Hello devs,

Curently there is no public file API function to get a filepath to a local file stored in Moodle data, path_from_hash is private.

I need the access to the filepath for my Global Search work (see MDL-31989). Having a file handle is not enough, because I will be processing files with external utilities (e.g. to extract text from PDF).

I know there are some strong opinions here that path_from_hash should stay private. I don't agree with it as this is not protecting anything: from any Moodle code you have full rights to delete whole Moodle data directory if you wish. You can also calculate where the real file is easily.

So the way I see it we have two options to get that for Global Search:

  • I will introduce some code duplication and calculate the path myself
  • we will make that funcion public

Am I missing anything? I don't really have strong preference one way or another, I would just like us to reach the conclusion, so I can finish Global Search work without any surprises smile.

cheers,
Tomek

 

Average of ratings: -
In reply to Tomasz Muras

Re: File API: direct access to a file

by Tim Hunt -
Picture of Core developers Picture of Documentation writers Picture of Particularly helpful Moodlers Picture of Peer reviewers Picture of Plugin developers

Files in the file pool are stored in a content addressed way. If you change the contents of a file in the pool, then the filename will be wrong, and you will fundamentally have broken things. Therefore, the API only provides ways to get the contents of the file read-only. (stored_file::get_content_file_handle, ::readfile, ::get_content, ::copy_content_to, ::copy_content_to_temp, ...). That makes perfect sense to me.

I seem to remember that when you raised this before, the real problem was that the Zend Lucene API could only work with file names (e.g. http://framework.zend.com/apidoc/1.11/db_Zend_Search_Lucene_Document_Pptx.html#%5CZend_Search_Lucene_Document_Pptx), rather than any more abstract way to refer the the stream of bytes that make up the files contents. That seems like a fairly obvious API flaw in Zend. Have you tried raising this with the Zend Lucene library maintainers?

In reply to Tim Hunt

Re: File API: direct access to a file

by Tomasz Muras -
Picture of Core developers Picture of Plugin developers Picture of Plugins guardians Picture of Translators

Just because I need a full path to the file, doesn't mean I want to change it - I want to access it in a read-only way. Also bear in mind that by exposing read-only file access APIs you are not protecting any files - nothing stops developer from changing them anyway, if they want to do that.

Zend Lucene API is not a problem - I could handle that with file handle only. I'm more concerned about reading text from the PDF files. The best utility I'm aware of is commandline pdftotext. I doubt there will be a decent replacement for it written in PHP.

cheers,
Tomek

In reply to Tomasz Muras

Re: File API: direct access to a file

by Tim Hunt -
Picture of Core developers Picture of Documentation writers Picture of Particularly helpful Moodlers Picture of Peer reviewers Picture of Plugin developers

Well, Zend Lucene seems to work by using a whole lot of classes like Zend_Search_Lucene_Document_Pptx. Why don't they have a Zend_Search_Lucene_Document_Pdf? Surely the right thing to do is to implement that class, and contribute it to Zend?

And, all those Zend_Search_Lucene_Document_* classes should take a higher level abstraction that $path in their constructor. path is one acceptable way to refer the the file contents to index, but the read-only file handle returned from the file API should be equally acceptable.

In reply to Tim Hunt

Re: File API: direct access to a file

by Tomasz Muras -
Picture of Core developers Picture of Plugin developers Picture of Plugins guardians Picture of Translators

You're suggesting to re-write text extraction from scratch instead of using already existing tools. Without even considering if it's a right thing or not - just think about time & effort required to do that.

I have only mentioned extraction from PDFs but then you would need to do the same for .doc, .ppt, .xls (again, the commandline tools to process them exist).

Forget about Zend_Search_Lucene_Document_* - my issue is not related to those.

cheers,
Tomek

In reply to Tomasz Muras

Re: File API: direct access to a file

by Tim Hunt -
Picture of Core developers Picture of Documentation writers Picture of Particularly helpful Moodlers Picture of Peer reviewers Picture of Plugin developers

Well, OK, so you want to use command-line tools for the text extraction, and my attempt to look at the Lucene docs just confused me when I saw the Zend_Search_Lucene_Document_* classes.

Anyway, running a command line program does not require a file path. The whole point about processes is that stdout and stdin are streams. It is perfectly possible to use the rea-only file handle you get from the file API to send data to a command-line process. See, for example https://github.com/maths/moodle-qtype_stack/blob/master/stack/cas/connector.unix.class.php#L56 and https://github.com/maths/moodle-qtype_stack/blob/master/stack/cas/connector.windows.class.php#L39. Not sure we even really need those separate implementations for win and *nix any more. That might be historical.

Average of ratings: Useful (1)
In reply to Tim Hunt

Re: File API: direct access to a file

by Tomasz Muras -
Picture of Core developers Picture of Plugin developers Picture of Plugins guardians Picture of Translators

It depends if the commandline program supports "-" as file argument and/or reads from stdin - but it's definitely a good idea worth checking.

Thanks Tim,
Tomek