• Birliban (unregistered)

    The Article from the future! 2017-10-13

  • MiserableOldGit (unregistered)

    Looks like a junior coder error ... TRWTF is how this got deployed like that. Don't they test anything there?

  • Pjrz (unregistered)

    Well...it does use a buffer to read!

    Sounds like someone had heard of buffering, or had been told to use buffering, but didn't quite understand the real purpose of it.

  • Uhm (unregistered)

    i recently stood before this problem, where the platform i'm developing for has a json library that doesn't work on a buffer, but instead on a string in memory.

    i begrudgingly coded a function like this that loads the entire file into memory, in the hope that it never exceeds 4kB or so...

  • Quite (unregistered) in reply to Uhm

    A-ha-ha, the old "4kB" implementation ...

    I worked on an application once which was an internal tool for configuring our complex and baroque customer systems, which were unwieldy enough as to need that tool to be heartstoppingly complicated. The design was ingenious and, on the surface, well-designed, and interfaced with a SQL database which was about as good as it could have been, given the circumstances.

    Except for one aspect: the modification history for the specific system configurations was held in a single field in a SQL table limited to 4000 characters. "4000 characters will be enough for anyone, surely?" the thinking must have been. But after a couple of years of complicated year-end client modification requests (each year brought in a new product line, for example), the mod history became, er yes that's right, longer than 4000 characters.

    The solution was obviously to amend the table to implement something more sophisticated than the 4000 characters. But this was unfeasible because of how unwieldy our software had become. So the actual solution was "if it crashes, go into mod history tool and remove the first however-many lines so as to allow room for the new mod history." Ugh.

    The fact that the app was written using EJB and powered by a combined Tomcat4-Jboss4 enterprise server, which had steadfastly refused to allow us to upgrade it, meant we couldn't even move forward to Java 5.

  • Jerry Leichter (unregistered)

    It's worse than the description makes clear. Every time a buffer is appended to the growing string object, the underlying storage for the string object has to grow. To grow it, a new, larger string object has to be allocated, the existing data copied, and the previous one freed. Until that free, there are two copies of the previous version of the object.

    string allocations can pre-allocate extra space to avoid the overhead of copying every time, but at some point they have to reallocate. But in the worst case, the last byte of data comes in all by itself, and it just happens that the pre-allocated space in the string object is exactly full - at which point you need double the size of the file.

    (For completeness: Some implementations of realloc() - assuming that's what the string implementation uses under the covers - notice when they are extending beyond the end of currently used memory and increase the size of a memory block without allocating a whole new one. Implementations like this have probably become less common over the years because in programs written with a modern style - and particularly in multi-threaded programs - the chance of ever actually being able to apply this optimization in real code is very small. It just might apply here, by dumb luck.)

  • RLB (unregistered)

    Also, THWAP Don't Use malloc() In C++.

  • Appalled (unregistered)

    Robocopy to the rescue.

  • (nodebb) in reply to Jerry Leichter

    "realloc() - assuming that's what the string implementation uses under the covers"

    Bad programmer. Bad. Twhap, newspaper to the nose. (Not my weapon of preference, as long-standing visitors will know, but it will suffice here.)

    Why?

    1. don't discuss malloc/calloc/free, nor realloc(1) in connection with C++, except to say not to use them.

    2. don't assume that the C++ STL uses the C allocator behind the scenes. It's allowed to, but you must never base anything on such an assumption.

    (1) Remember that realloc functions as a complete allocator all by itself. realloc(NULL, size) is equivalent to malloc(size). realloc(p,0) is equivalent to free(p) except that realloc(NULL,0) is UB because it isn't clear whether it is malloc(0) or free(NULL).

  • Kanitatlan (unregistered)

    I once had to sort out an application whose purpose was to generate a formatted CSV archive of database data on a daily basis. It was supposed to run daily but if it didn't it was at least set up to deal with the back-log (but all in one go). Unfortunately the programmer in question had used the following algorithm

    1. Run a ADO query to collect all the records but not read only and not unidirectional so this cached all the data
    2. Create a complete copy of all the data as an object representing the "read" operation
    3. Copy that into another object representing the "enterprise" management object
    4. Copy that into another object for output
    5. Write to the CSV
    6. Repeat for all the other tables covered making sure that nothing is over closed or dropped

    One day it broke and nobody noticed for a month with only too predictable consequences

    Needless to say it received a rewrite to use a unidirectional query and stream direct to the output file along with removing all the "enterprise" standard management structure. About 90% less code apart from anything else. I would have submitted it as a WTF but being all enterprisy no one bit of the code explained anything of the almighty WTF it comprised in total.

  • Christopher Jefferson (google)

    This might be a WTF, but it's also very common behaviour. Git does exactly the same thing, meaning you can create repositories on a 64-bit system that don't work on a 32-bit system, as you can't load the whole file into memory at once. I think on 32 macs you couldn't have a file bigger than 400MB or so, as that we the largest block of contiguous memory the OS would give you.

  • OzPeyetr (unregistered)

    Minor additional WTF .. but a personal gripe with C based languages. The asterisk and the ampersand in the variable definitions are really bound to the data type and not to the variable name. EG it is const char* filename and std::string& result and not const char *filename and std::string &result. C style languages just let you get away with it. And then you have things like char * buffer where the coder can't commit one way or another.

  • Mr Bits (unregistered) in reply to OzPeyetr

    char* foo, bar;

    Explain what the variables foo and bar are. Then explain to us again how the * is bound to the data type.

  • OzPeyetr (unregistered) in reply to Mr Bits

    Congratulations .. you just pointed out an inconstancy in C base languages. While * and & are type modifiers that regrettably are bound to the variable name, they are still type modifiers.

  • Gummy Gus (unregistered)

    Not to mention that the read loop is basically WRONG, as it stops reading on EOF or error and doesn't differentiate between the two.

  • (nodebb) in reply to OzPeyetr

    There's an alternative way to read them, of course.

    Rather than saying "there is a pointer to int and it is called frobble", read it as "there is an int and it is reached by an expression like '*frobble'." That way is becomes natural for the star to bind to the variable. (The star and other decoration indicate the route to the type, not the type itself.)

    And the "star bound to datatype" only works for simple star and ampersand.

    Would you write "pointer to array" as: int(* p)[SIZE] ?

    Heck, even "array of pointers" looks odd: int* arr[SIZE], to say nothing of a function pointer: int(* p)(int x, int y, char zz, bool top) (most of the declaration appears after the name of the variable).

    C-style declarations are what they are, and it is 40+ years too late to change them.

  • Ulysses (unregistered)

    Despite the breakdown with fancier cases, I'm with Oz. Where it's more natural, keep the type info with the type.

    Back to the article. The OP's unholy example of malloc() reminds me of a useful technique I employ in the opposite scenario. If, say, WinAPI needs me to provide a variable output buffer, I supply a (p)resized std::string and unconst the value of data(). I benefit from automatic destruction as well as the small string optimization. Boom.

  • Uhm (unregistered) in reply to Mr Bits

    char* foo, bar;

    Explain what the variables foo and bar are. Then explain to us again how the * is bound to the data type.

    really, that's the problem with this construct, and why it's a deeper language design problem, instead of a "style" problem whether * should have a space before or after it.

    in "normal" languages you have "foo, bar : POINTER TO CHAR" and it's really clear and obvious what is meant.

  • isthisunique (unregistered)

    This isn't strictly a WTF. When memory was limited this would have been problematic. Today memory is cheap and you can get away with things like this. Streaming isn't exactly free to implement or always viable. These days whether something like this is a problem has to be taken on case by case basis.

  • snoofle (unregistered) in reply to isthisunique

    It's not so much that it's not a wtf, as a comment on the state of education and training of people who learned to program using anything that auto-gc's; they don't appreciate what's going on under the hood.

    About 12 years ago, I ended up pulling a 30 hour Sunday debugging session to figure out why some server-app was getting hosed and spinning off the deep end at random intervals. The person responsible was out for a couple of days and had (for unrelated reasons) changed the password to source control, so we had to debug by using trace, and logically step through the code. Long story short, the coder made the rookie mistake of assuming that a read from a socket would always get the full message (requested number of bytes), and just coded a single socket-read instead of looping until the entire message had been read. A Friday night OS patch had exposed the bug. By the time the guy got in on Monday morning, we told him the PROPER way to code it.

  • Sole Purpose of Visit (unregistered)

    I hate to be picky here, but absolutely no part of this wretched "solution" can be described as C++. Unless, of course, you wish to think of bog-standard C as "degenerate C++."

    I dunno, kids these days ...

  • Anonymous (unregistered)

    And where does the star belong when you write char const * const p?

    For me it depends on the context.

    Char const p or char const p?

    Read from the right side and all is clear even if it seems inconsistent.

    P is a pointer to const char or p is a const pointer to char. Easy.

  • bvs23bkv33 (unregistered)

    I was parsing 1,5 meg xml and inserting it in MySQL database, and database was running on 1,6 GHz Celeron, so uploading all the file before communicating with database was only way to avoid curl timeouts

  • Greg (unregistered) in reply to Ulysses

    You'ld be safer with a vector: there's no guarantee a string will be allocated as one big chunk of memory (although in most cases it will) while that guarantee exists for vector (where you can then use the address of the first member to access the underlying memory chunk: &v[0])

  • (nodebb) in reply to Greg

    I'm pretty sure that std::string does allocate a contiguous buffer for the string, so that its operator [] can do something sane. The main differences between std::string and std::vector are approximately: -- std::basic_string<T, ...> assumes that the element type can have a value of zero so that that the c_str() method can work, while std::vector does not need to make an assumption.(1) -- The above also permits a std::basic_string<T,...> to be initialised from a single T *, where it will read up to (but not including) the first zero it encounters. This doesn't work for std::vector. -- To keep up with the implication that the string is, well, a string, std::basic_string<T,...> supports binary operator + as a concatenation operator, as well as some substring and search operations that aren't in std::vector.

    (1) T does not need to be an integer type, but must be "compatible" with 0 - that is, constructing a T with a single parameter of zero must be possible, and the result must compare equal to 0. A std::basic_string<T,...> can contain embedded "zeroes", although they will make c_str() do ... strange things. (Not UB, but not necessarily what you expect either, although it is what you should expect.)

  • Greg (unregistered) in reply to Steve_The_Cynic

    I looked some stuff up in the specs and apparently as of C++17, strings are guaranteed to be contiguous memory chunks as well (vector is contiguous since C++03 btw). Truth be told, I never encountered a string implementation that wasn't contiguous, but when writing code that might get ported to other platforms and or compilers I prefer not to make any assumptions.

  • Kanitatlan (unregistered) in reply to snoofle

    My lot have, several times, compounded the rookie mistake of every socket read is a whole message with the rookie mistake of every socket read is precisely one whole message even if you simply ask for everything in the buffer.

  • foxyshadis (unregistered) in reply to Kanitatlan

    The whole "read the entire (joined) table and filter it at your leisure on the front-end" is my favorite database antipattern, because it's always the first one I see when I get to a new site. SQL can be a pain in the ass, but it's not THAT hard to learn the basics and keep from having to loop over a million records every fetch!

  • Ulysses (unregistered) in reply to Greg

    Safer on an academic's whiteboard maybe. As has already been alluded, how would c_str()/data() ever work if the implementation resembled a typical std::deque? The C++17 guarantee is merely an overdue formality.

  • siciac (unregistered) in reply to snoofle

    It's not so much that it's not a wtf, as a comment on the state of education and training of people who learned to program using anything that auto-gc's; they don't appreciate what's going on under the hood.

    Auto-gc was key to the rise of agile languages like Javascript, and that has led to code that is a mess of "data pasta."

    And while I can think of one guy who was a counterexample to this, generally my experience has been that people who don't understand memory allocation also aren't producing nicely structured code nor do they make use of classes. They're generally copying crap from so.

    To my jaded eye, as programming is becoming more accessible, we just have more coders that suck at coding.

  • siciac (unregistered) in reply to Ulysses

    As has already been alluded, how would c_str()/data() ever work if the implementation resembled a typical std::deque? The C++17 guarantee is merely an overdue formality.

    It could allocate a copy that was freed when the string's destructor was called. (Which would keep it around longer than you might want.) If string used a rope implementation, creating the C-str lazily could make sense. The trouble is, the rest of the API doesn't lend itself to that, and ropes aren't a good drop-in replacement for strings do. I suspect they found it's surprisingly hard to improve on a simple array of characters for representing a string.

  • isthisunique (unregistered)

    I said before that this isn't a WTF but didn't clarify clearly on that point. I was of course only referring to the read all, write all behaviour. I have overlooked that there might not be any processing (could be move or copy). This is an implementation issue that comes into play whether the IO in the application layer is necessary or not.

    The reasons beyond convenience and simplicity for which you might want to do this rather than streaming input and output if you have enough memory is fault tolerance. In such a case doing one pass at a time can yield better results. It is also common to see streaming into processing but buffered for write which only happens on successful completion of all previous processes.

    I've written several systems like this that would look strange up front but do have a specific intention of mitigating the impact of faults. A common example might be where you need to write several set calls to an API. If you don't have transactions or anything like that and you have to make multiple calls, you'll want to minimise where an operation can be interrupted and ensure that it can be replayed. If you have an error during reading and processing the chances are you wont know what it is. If you coded defensively and something hits a trap there's nothing you can do. Something unexpected happened and the program must stop. In a lot of cases you can't eliminate breakages but you can manage their impact.

    Some languages also do not lend too well to asyncronous IO. However C++ shouldn't be too bad, especially with a number of modern libraries that make it a breeze.

  • Patrick (unregistered) in reply to Steve_The_Cynic

    "malloc/realloc/free" can be accurately used to refer to the basic functions of a dynamic memory allocator, as opposed to a specific implementation.

    Also, while you can theoretically stay in your C++ la-la-land without ever invoking a raw memory allocator directly, that bubble is gonna burst as soon as you have to interact with external libraries. Even within a single large project (obvious example: any major web browser) you can frequently encounter things like having to deal with several different allocators/heaps with different rules and responsibilities for using, freeing and passing around pointers to their memory. Not to mention kernel code. There's a reason C++ has things like placement new, you know...

    This is what's so annoying with C++ fanboys. They seem to be collectively unable to reason at several levels of abstraction simultaneously. Maybe that part of the brain has to go to be able to fit all the esoteric C++ trivia they all seem to be so fond of.

  • Patrick (unregistered) in reply to Patrick

    PS. STL itself lets you specify which allocator to use. So much for not ever having to make assumptions about how it allocates memory, or being a bad coder for having to do it. In certain cases you even have to adapt your allocation patterns to the specific allocator in use for performance/memory use reasons, or use a custom allocator that's optimized for your allocation patterns.

    Abstractions leak; learn to deal with it.

  • (nodebb) in reply to siciac

    I believe Netscape 6's rendering engine used to use ropes, but they switched to flat strings for Netscape 7.

    On the other hand, I think Firefox's JavaScript engine switched from flat strings to ropes at some point...

Leave a comment on “RAM On Through”

Log In or post as a guest

Replying to comment #489048:

« Return to Article